Closed kpalin closed 4 years ago
Thanks for the feedback @kpalin and sorry for the issues.
README.md
with more detailed installation instructions.README.md
) h5dump
on a read pre and post the h5repack
to see how the filters change, i.e,$ h5dump -d "${DATASET_NAME}" -H -p read.fast5 | grep -A6 FILTERS
FILTERS
FILTERS {
USER_DEFINED_FILTER {
FILTER_ID 32020
COMMENT vbz
PARAMS { 0 2 1 1 }
}
}
Attached is the complete h5dump of the PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.VBZ_afterdeb.fast5 generated with the above command.
The input file PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.fast5
is direct output of stand-alone Guppy v3.2.1 with dna_r9.4.1_450bps_modbases_dam-dcm-cpg_hac.cfg
model. Before that, the file had been basecalled live with fast model on PromethION.
So that file doesn't appear to contain any raw signal datasets. Can you share the original input file?
$ grep -c Raw PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.txt
0
$ grep -c vbz PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.txt
1
$ sed -n '/DATASET/,/FILLVALUE/p' PAD64960_b924920b26f29f6118f165b040093705d5beb46b_106_fast5_pass.txt
DATASET "Fastq" {
DATATYPE H5T_STRING {
STRSIZE 2830;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 2830
OFFSET 6056
}
FILTERS {
NONE
}
FILLVALUE {
DATASET "Fastq" {
DATATYPE H5T_STRING {
STRSIZE 2736;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
STORAGE_LAYOUT {
CONTIGUOUS
SIZE 2736
OFFSET 14206
}
FILTERS {
NONE
}
FILLVALUE {
DATASET "ModBaseProbs" {
DATATYPE H5T_STD_U8LE
DATASPACE SIMPLE { ( 1284, 6 ) / ( 1284, 6 ) }
STORAGE_LAYOUT {
CHUNKED ( 128, 6 )
SIZE 0 (0.000:1 COMPRESSION)
}
FILTERS {
USER_DEFINED_FILTER {
FILTER_ID 32020
COMMENT vbz
}
}
FILLVALUE {
VBZ was applied to the ModBaseProbs
but won't be effective on that dataset and is probably the cause of the increased size.
The dump above ('afterdeb') is from the result of step 4 in the original, i.e. when h5repack really uses VBZ. And you're right the output does not contain any data.
The input data is 100k reads, total of 8.8GB 'normally' compressed (8.7GB with GZIP=1, 8.3GB with GZIP=9) With no VBZ installed, the h5repack command in the Readme still 'works' but produces file of size 19GB. The results (specifically the 'crash' with VBZ) are the same with the original fast5 file from PromethION live calling.
Unfortunately I can't share the read data.
Okay so the repack without vbz which increases the size of the file is probably using no compression (from GZIP). I will investigate further with input sets with other dataset types such as ModBaseProbs
to try and reproduce the silent crashing behaviour.
For now, can you try the script I just added, fast5vbz.py (requires h5py and the vbz plugin installed). This will copy the input file and compress the signal datasets only, a repack is still needed post fast5vbz.py
.
$ python fast5vbz.py /data/test/input.fast5
/data/test/input.fast5.tmp
$ h5repack /data/test/input.fast5.tmp /data/test/input.fast5.vbz
https://github.com/nanoporetech/ont_fast5_api now has tools to repack files using vbz compression.
Few issues:
libvbz_hdf_plugin.so
fromont-vbz-hdf-plugin_1.0.0-1.xenial_amd64.deb
and settingHDF5_PLUGIN_PATH
makes the h5repack commands in the readme to find the library, but repacking crashes (silently) on first read.Command for set 4. above