Nonstarter - Githubissues

kpalin commented 5 years ago

Few issues:

Installation instructions are insufficient.
The python wheel in releases doesn't appear to have anything to do with hdf5 library so installing it does not help with e.g. h5repack.
(More of HDF5 issue) If h5repack doesn't find the filter, it quietly expands the fast5s about 2.5x size.
Extracting libvbz_hdf_plugin.so from ont-vbz-hdf-plugin_1.0.0-1.xenial_amd64.deb and setting HDF5_PLUGIN_PATH makes the h5repack commands in the readme to find the library, but repacking crashes (silently) on first read.

Command for set 4. above

$ h5repack -v -f UD=32020,5,0,0,2,1,1 PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.fast5 PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.VBZ_afterdeb.fast5 
No all objects to modify layout
All objects to apply filter are...
 User Defined 32020
Making new file ...
-----------------------------------------
 Type     Filter (Compression)     Name
-----------------------------------------
 group                       /
  attr                        file_version
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833
  attr                        run_id
  attr                        pore_type
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_000
  attr                        name
  attr                        version
  attr                        time_stamp
  attr                        model_type
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_000/BaseCalled_template
 dset       (1.000:1)        /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_000/BaseCalled_template/Fastq
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_000/Summary
  attr                        return_status
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_000/Summary/basecall_1d_template
  attr                        num_events
  attr                        sequence_length
  attr                        mean_qscore
  attr                        strand_score
  attr                        skip_prob
  attr                        stay_prob
  attr                        step_prob
  attr                        block_stride
  attr                        basecall_location
  attr                        basecall_scale
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_001
  attr                        component
  attr                        model_type
  attr                        name
  attr                        segmentation
  attr                        time_stamp
  attr                        version
 group                       /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_001/BaseCalled_template
 dset       (1.000:1)        /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_001/BaseCalled_template/Fastq
  attr                        table_version
 dset     UD   (0.000:1)     /read_00086572-ac87-47d5-8a2d-8e64435e6833/Analyses/Basecall_1D_001/BaseCalled_template/ModBaseProbs
  attr                        modified_base_long_names
  attr                        output_alphabet
  attr                        table_version

iiSeymour commented 5 years ago

Thanks for the feedback @kpalin and sorry for the issues.

We will improve the README.md with more detailed installation instructions.
The wheel provides pyvbz which is a stand-alone Python library and doesn't provide (or need) the plugin (again, I'll update the main README.md)
Can you run h5dump on a read pre and post the h5repack to see how the filters change, i.e,

$ h5dump -d "${DATASET_NAME}" -H -p read.fast5 | grep -A6 FILTERS
FILTERS
  FILTERS {
     USER_DEFINED_FILTER {
        FILTER_ID 32020
        COMMENT vbz
        PARAMS { 0 2 1 1 }
     }
  }

Is this on the original read or post 3. where the filesize has increase by 2.5x?

kpalin commented 5 years ago

Attached is the complete h5dump of the PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.VBZ_afterdeb.fast5 generated with the above command.

The input file PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.fast5 is direct output of stand-alone Guppy v3.2.1 with dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg model. Before that, the file had been basecalled live with fast model on PromethION.

iiSeymour commented 5 years ago

So that file doesn't appear to contain any raw signal datasets. Can you share the original input file?

$ grep -c Raw PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.txt
0
$ grep -c vbz PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.txt                                                                                                                         
1
$ sed -n '/DATASET/,/FILLVALUE/p' PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.txt
               DATASET "Fastq" {
                  DATATYPE  H5T_STRING {
                     STRSIZE 2830;
                     STRPAD H5T_STR_NULLTERM;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
                  DATASPACE  SCALAR
                  STORAGE_LAYOUT {
                     CONTIGUOUS
                     SIZE 2830
                     OFFSET 6056
                  }
                  FILTERS {
                     NONE
                  }
                  FILLVALUE {
               DATASET "Fastq" {
                  DATATYPE  H5T_STRING {
                     STRSIZE 2736;
                     STRPAD H5T_STR_NULLTERM;
                     CSET H5T_CSET_ASCII;
                     CTYPE H5T_C_S1;
                  }
                  DATASPACE  SCALAR
                  STORAGE_LAYOUT {
                     CONTIGUOUS
                     SIZE 2736
                     OFFSET 14206
                  }
                  FILTERS {
                     NONE
                  }
                  FILLVALUE {
               DATASET "ModBaseProbs" {
                  DATATYPE  H5T_STD_U8LE
                  DATASPACE  SIMPLE { ( 1284, 6 ) / ( 1284, 6 ) }
                  STORAGE_LAYOUT {
                     CHUNKED ( 128, 6 )
                     SIZE 0 (0.000:1 COMPRESSION)
                  }
                  FILTERS {
                     USER_DEFINED_FILTER {
                        FILTER_ID 32020
                        COMMENT vbz
                     }
                  }
                  FILLVALUE {

VBZ was applied to the ModBaseProbs but won't be effective on that dataset and is probably the cause of the increased size.

kpalin commented 5 years ago

The dump above ('afterdeb') is from the result of step 4 in the original, i.e. when h5repack really uses VBZ. And you're right the output does not contain any data.

The input data is 100k reads, total of 8.8GB 'normally' compressed (8.7GB with GZIP=1, 8.3GB with GZIP=9) With no VBZ installed, the h5repack command in the Readme still 'works' but produces file of size 19GB. The results (specifically the 'crash' with VBZ) are the same with the original fast5 file from PromethION live calling.

Unfortunately I can't share the read data.

iiSeymour commented 5 years ago

Okay so the repack without vbz which increases the size of the file is probably using no compression (from GZIP). I will investigate further with input sets with other dataset types such as ModBaseProbs to try and reproduce the silent crashing behaviour.

For now, can you try the script I just added, fast5vbz.py (requires h5py and the vbz plugin installed). This will copy the input file and compress the signal datasets only, a repack is still needed post fast5vbz.py.

$ python fast5vbz.py /data/test/input.fast5
/data/test/input.fast5.tmp
$ h5repack /data/test/input.fast5.tmp /data/test/input.fast5.vbz

0x55555555 commented 4 years ago

https://github.com/nanoporetech/ont_fast5_api now has tools to repack files using vbz compression.

nanoporetech / vbz_compression

Nonstarter #1