hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
94 stars 6 forks source link

Converting back to fast5 increases size #58

Closed DrownedMala closed 2 years ago

DrownedMala commented 2 years ago

Hello there, I was trying the conversion f2s and it all worked out pretty well, but once I tried s2f it generated a folder bigger than the original one:

du -sh fast5_1st blow5_1st fast5_again with output: 4.5G fast5_1st 2.8G blow5_1st 6.0G fast5_again

Commands I used: slow5tools f2s fast5_1st/ -d blow5_1st/ slow5tools s2f blow5_1st/ -d fast5_again/

I don't know if it's an issue or if it has something to do with compression mechanisms I am just not aware of, it just felt right to report back! Thanks for any reply, have a good day and keep up the good work! Simone

hiruna72 commented 2 years ago

Hello @DrownedMala

Thank you for reporting this. I suspect that the original fast5 had ONT's latest compression called vbz (that uses Z standard and StreamVByte with zig-zag delta techniques). s2f creates default gzip compressed fast5 files which can be larger.

Could you please,

  1. Make sure the number of original and s2fed files are the same. ls DIR | wc
  2. Use h5stat to check the compression method of fast5 file(s).
$ h5stat a_vbz_compressed.fast5 | grep filter
Dataset filters information:
        NO filter: 0
        GZIP filter: 0
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 100

$ h5stat a_gzip_compressed.fast5 | grep filter
Dataset filters information:
        NO filter: 0
        GZIP filter: 100
        SHUFFLE filter: 0
        FLETCHER32 filter: 0
        SZIP filter: 0
        NBIT filter: 0
        SCALEOFFSET filter: 0
        USER-DEFINED filter: 0

Thank you Regards, Hiruna

DrownedMala commented 2 years ago

Yes, thank you for your reply! So, I checked and the number of file is the same in both directories. While, for the compression method, here it is what I get:

h5stat fast5_1st/FAL46657_19f232d7_0.fast5 | grep filter Dataset filters information: NO filter: 0 GZIP filter: 0 SHUFFLE filter: 0 FLETCHER32 filter: 0 SZIP filter: 0 NBIT filter: 0 SCALEOFFSET filter: 0 USER-DEFINED filter: 4000

h5stat fast5_again/FAL46657_19f232d7_0.fast5 | grep filter Dataset filters information: NO filter: 0 GZIP filter: 4000 SHUFFLE filter: 0 FLETCHER32 filter: 0 SZIP filter: 0 NBIT filter: 0 SCALEOFFSET filter: 0 USER-DEFINED filter: 0

hasindu2008 commented 2 years ago

Ah yes. Your original data is in 'vbz' compressed format. However, s2f as the moment write files in zlib compressed format. That is why the file size gets bigger.

In theory, you can convert any zlib fast5 to vbz fast5 using ONT's compression program. But I highly discourage this as their compression program is buggy and damages fields in their own format (see #59).

Perhaps in future, we could give an option in slow5tools s2f to directly generate FAST5 in vbz. But at the moment this is not a priority because once we convert to SLOW5 for archival purposes the only need for converting back to FAST5 is when rebasecalling using Guppy (as Guppy is not opensource and thus we can't contribute to SLOW5 support on it) and for this converted FAST5 the compression format does not matter much as it is temporary.

DrownedMala commented 2 years ago

I see, thanks for the support!

Good work, keep it up! Cheers, Simone