nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
149 stars 29 forks source link

filesize increased reconverting vbz compressed fast5 back to gzip compression #56

Closed biolene closed 3 years ago

biolene commented 3 years ago

Hi

We've been testing ont_fast5_api, and we encounter a recurring issue if we reconvert vbz-compressed fast5 files back to gzip. The gzipped fast5 files that we get after reconversion have a larger size than the original ones, although the compression level is 1 in both cases (according to h5dump results). An example of the commands we run:

Original fast5 file (gzipped): 97.072 MB

Conversion to vbz: compress_fast5 -i /path/to/fast5 --in_place --compression vbz

Conversion back to gzip: compress_fast5 -i /path/to/fast5 --in_place --compression gzip

And then the file size of this final file is 101.155 MB

What may be causing these differences, and is this cause for concern? We want to be sure that we do not lose any data with this compression.

fbrennen commented 3 years ago

Hi @biolene -- there are some other compression options that can affect file size, such as shuffling. Do you see anything else in the compression settings when you use h5dump? Have you tried a comparison without using --in_place?

Both vbz and gzip are 100% lossless, so you will never lose any data. I'd be happy to look at one of your files to try and identify differences, but I would only expect to find small changes in compression settings.

biolene commented 3 years ago

If I don't use --in_place, I end up with a similar result, so also in that case the final gzip is somewhat larger than the initial one. (actually also exactly 101.155 MB for the example file above, but I haven't tested it one more files)

I don't see a reference to shuffling in the compression settings. Below is a sample of 1) the h5dump output I get for the original fast5 file, and 2) the fast5 file we obtain after reconverting from vbz to gzip. I'm not that familiar with the hdf5 format, so I might be missing something obvious. I also tried to run h5diff to compare both files, but this gave me an error that some objects cannot be compared, so with that I more or less ran out of ideas...

1) DATASET "Signal" { DATATYPE H5T_STD_I16LE DATASPACE SIMPLE { ( 15208 ) / ( H5S_UNLIMITED ) } STORAGE_LAYOUT { CHUNKED ( 15208 ) SIZE 19088 (1.593:1 COMPRESSION) } FILTERS { COMPRESSION DEFLATE { LEVEL 1 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE H5D_FILL_VALUE_DEFAULT } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR

2) DATASET "Signal" { DATATYPE H5T_STD_I16LE DATASPACE SIMPLE { ( 10520 ) / ( 10520 ) } STORAGE_LAYOUT { CHUNKED ( 5260 ) SIZE 13979 (1.505:1 COMPRESSION) } FILTERS { COMPRESSION DEFLATE { LEVEL 1 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_ALLOC VALUE H5D_FILL_VALUE_DEFAULT } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR }

fbrennen commented 3 years ago

Hi @biolene -- thanks very much for the detailed answer. Are those h5dump results for the same read_id?

Usually the reason you see "some objects cannot be compared" errors when running h5diff is because hdf5 doesn't know how to deal with vbz-compressed datasets. I'm not sure that applies here, as it's not clear to me if you're comparing two gzipped files or gzip and vbz, but you can solve that particular issue by either:

biolene commented 3 years ago

The h5dump results were not for the same read, I just took two arbitrary snippets from the output. I tried to compare two gzipped files, so I guess the issue you mention does not apply. Unfortunately I'm not authorized to send you data (confidentiality reasons). If you have any suggestions on what I could still or try, please let me know, and thanks in advance for your advice!

fbrennen commented 3 years ago

Hi @biolene -- that's ok, we'll try and work with it. If you have a chance to try and specify HDF5_PLUGIN_PATH and see if h5diff works then that would be very helpful.

biolene commented 3 years ago

Hi @fbrennen, I've done as you suggested and specified the HDF5_PLUGIN_PATH, and then run h5diff again, but still got the 'Some objects are not comparable' message. When I run it with -c to get a list of objects, I get: 'Not comparable: or is of mixed string type'

fbrennen commented 3 years ago

Excellent, that's great progress -- is that the only field that's raising an error? Could you show me what the same file_version field looks like in the two versions, when running h5dump?

biolene commented 3 years ago

file_version field for 'original' gzipped file:

ATTRIBUTE "file_version" { DATATYPE H5T_STRING { STRSIZE 4; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "1.0" } }

file_version field for the file that was compressed with vbz, and then back to gzip:

ATTRIBUTE "file_version" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "2.0" }

fbrennen commented 3 years ago

Ah, great, ok -- ont_fast5_api always sets the version of the files it writes to the one it knows about (as it doesn't explicitly know what "version 1.0" requires), and it looks like we're also changing the type of the string at the same time (from ASCII to UTF-8).

If that's the only issue you're seeing then I would hope you're happy that, other than the version string, the contents of the files are the same. We'll add a section to the documentation on how to confirm it, for future reference -- this has been very helpful!

biolene commented 3 years ago

Yes, maybe one last thing: if I run h5diff in verbose mode (-v) then I see also two types of warnings popping up, e.g.

group : and 0 differences found Warning: different storage datatype attribute: <pore_type of > and <pore_type of > 0 differences found Warning: different storage datatype

dataset: </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> and </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> Warning: different maximum dimensions </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> has max dimensions [18446744073709551615] </read_ffa16da7-9093-4f0f-ad3e-95da27ca37fc/Raw/Signal> has max dimensions [247985] 0 differences found

But then again it also reports that 0 differences are found, so I should probably not be concerned about it?

fbrennen commented 3 years ago

Yes, that's right -- those are down to some of the slight differences in storage format you posted above, such as using H5S_UNLIMITED for storing Signal data instead of a finite size.

biolene commented 3 years ago

OK, I figured that. But that's great, thanks a million for your support! I'll close the issue now, since it's resolved :)

fbrennen commented 3 years ago

Glad to help, and good to see everything's coming out (almost) as expected. =)