Closed biolene closed 3 years ago
Hi @biolene -- there are some other compression options that can affect file size, such as shuffling. Do you see anything else in the compression settings when you use h5dump
? Have you tried a comparison without using --in_place
?
Both vbz and gzip are 100% lossless, so you will never lose any data. I'd be happy to look at one of your files to try and identify differences, but I would only expect to find small changes in compression settings.
If I don't use --in_place, I end up with a similar result, so also in that case the final gzip is somewhat larger than the initial one. (actually also exactly 101.155 MB for the example file above, but I haven't tested it one more files)
I don't see a reference to shuffling in the compression settings. Below is a sample of 1) the h5dump output I get for the original fast5 file, and 2) the fast5 file we obtain after reconverting from vbz to gzip. I'm not that familiar with the hdf5 format, so I might be missing something obvious. I also tried to run h5diff to compare both files, but this gave me an error that some objects cannot be compared, so with that I more or less ran out of ideas...
1) DATASET "Signal" { DATATYPE H5T_STD_I16LE DATASPACE SIMPLE { ( 15208 ) / ( H5S_UNLIMITED ) } STORAGE_LAYOUT { CHUNKED ( 15208 ) SIZE 19088 (1.593:1 COMPRESSION) } FILTERS { COMPRESSION DEFLATE { LEVEL 1 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE H5D_FILL_VALUE_DEFAULT } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR
2) DATASET "Signal" { DATATYPE H5T_STD_I16LE DATASPACE SIMPLE { ( 10520 ) / ( 10520 ) } STORAGE_LAYOUT { CHUNKED ( 5260 ) SIZE 13979 (1.505:1 COMPRESSION) } FILTERS { COMPRESSION DEFLATE { LEVEL 1 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_ALLOC VALUE H5D_FILL_VALUE_DEFAULT } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR }
Hi @biolene -- thanks very much for the detailed answer. Are those h5dump
results for the same read_id?
Usually the reason you see "some objects cannot be compared" errors when running h5diff
is because hdf5 doesn't know how to deal with vbz-compressed datasets. I'm not sure that applies here, as it's not clear to me if you're comparing two gzipped files or gzip and vbz, but you can solve that particular issue by either:
HDF5_PLUGIN_PATH
environment variable to point to the location of the vbz plugin (there's one inside ont-fast5-api
around about here if you're using a virtual environment: ./venv/lib/pythonX.Y/site-packages/ont_fast5_api/vbz_plugin
)The h5dump
results were not for the same read, I just took two arbitrary snippets from the output.
I tried to compare two gzipped files, so I guess the issue you mention does not apply.
Unfortunately I'm not authorized to send you data (confidentiality reasons). If you have any suggestions on what I could still or try, please let me know, and thanks in advance for your advice!
Hi @biolene -- that's ok, we'll try and work with it. If you have a chance to try and specify HDF5_PLUGIN_PATH
and see if h5diff
works then that would be very helpful.
Hi @fbrennen, I've done as you suggested and specified the HDF5_PLUGIN_PATH
, and then run h5diff
again, but still got the 'Some objects are not comparable' message. When I run it with -c
to get a list of objects, I get:
'Not comparable:
Excellent, that's great progress -- is that the only field that's raising an error? Could you show me what the same file_version
field looks like in the two versions, when running h5dump
?
file_version
field for 'original' gzipped file:
ATTRIBUTE "file_version" { DATATYPE H5T_STRING { STRSIZE 4; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "1.0" } }
file_version
field for the file that was compressed with vbz, and then back to gzip:
ATTRIBUTE "file_version" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "2.0" }
Ah, great, ok -- ont_fast5_api
always sets the version of the files it writes to the one it knows about (as it doesn't explicitly know what "version 1.0" requires), and it looks like we're also changing the type of the string at the same time (from ASCII to UTF-8).
If that's the only issue you're seeing then I would hope you're happy that, other than the version string, the contents of the files are the same. We'll add a section to the documentation on how to confirm it, for future reference -- this has been very helpful!
Yes, maybe one last thing: if I run h5diff
in verbose mode (-v
) then I see also two types of warnings popping up, e.g.
group : and 0 differences found Warning: different storage datatype attribute: <pore_type of > and <pore_type of > 0 differences found Warning: different storage datatype
dataset: </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> and </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> Warning: different maximum dimensions </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> has max dimensions [18446744073709551615] </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> has max dimensions [247985] 0 differences found
But then again it also reports that 0 differences are found, so I should probably not be concerned about it?
Yes, that's right -- those are down to some of the slight differences in storage format you posted above, such as using H5S_UNLIMITED
for storing Signal
data instead of a finite size.
OK, I figured that. But that's great, thanks a million for your support! I'll close the issue now, since it's resolved :)
Glad to help, and good to see everything's coming out (almost) as expected. =)
Hi
We've been testing ont_fast5_api, and we encounter a recurring issue if we reconvert vbz-compressed fast5 files back to gzip. The gzipped fast5 files that we get after reconversion have a larger size than the original ones, although the compression level is 1 in both cases (according to h5dump results). An example of the commands we run:
Original fast5 file (gzipped): 97.072 MB
Conversion to vbz: compress_fast5 -i /path/to/fast5 --in_place --compression vbz
Conversion back to gzip: compress_fast5 -i /path/to/fast5 --in_place --compression gzip
And then the file size of this final file is 101.155 MB
What may be causing these differences, and is this cause for concern? We want to be sure that we do not lose any data with this compression.