how to compare fast5 files

RichardCorbett commented 2 years ago

Hi folks.

I'm very happy to see that you worked on this project and published it. I hope the folks in Oxford pick up some of what you showed and takes advantage of these findings.

I am running some tests using version 0.3.0.

I have the following datasets based on a single .fast5 file containing 4000 reads of genomic ONT promethion reads.

dataset	file method	file size (kb)
1	original zlib fast5	3027586
2	input: 1, f2s zlib record and svb-zd signal compression	1839624
3	input: 1, f2s zstd record and svb-zd signal compression	1771464
4	input: 2, s2f	2701116
5	input: 3 s2f	2701116

The .fast5->blow5->.fast5 results in 4 and 5 resulted in the exact same file independent of the compression params, but both files differ in size from the original.

I've been poking around with h5diff to verify that everything in my original fast5 is still recovered in the resulting .fast5 after roundtripping through blow5. I can't seem to get it to report differences, or to confirm that the file contents are the same. Can you share the approach you used to compare .fast5 file contents? I also tried basecalling multiple times with guppy, but the non-deterministic manner in which the basecalling is done limits the utility here.

Also, is the size difference between 1, 4 and 5 possibly just due to a difference in compression? I don't see any information describing if the .fast5->blow5->.fast5 resulting file is zlib, vbz, or otherwise compressed.

thanks Richard

hiruna72 commented 2 years ago

Hello Richard,

Thank you for trying slow5tools!

There is no direct fast5 comparison method. We run guppy for the original and the s2fed fast5 and then, compare the outputs (fastq and sequencing summary files). This is the ultimate test we do before a release. (https://github.com/hasindu2008/slow5tools/blob/master/test/test_with_guppy.sh)

While developing, we compare the outputs of f2s and f2s->s2f->f2s. This should be run on a small dataset as it creates ASCII slow5. (https://github.com/hasindu2008/slow5tools/blob/master/test/test_f2s_s2f_integrity.sh)

The reason for the size reduction could be the 'unaccounted space' in hdf5 file format. When a fast5 file is updated after creation, garbage space is accumulated (https://docs.hdfgroup.org/hdf5/rfc/FileSpaceManagement.pdf). I assume the original fast5 files have more unaccounted space than the s2f output. You can check it with the following command.

h5stat -S [fast5 file]

For example, a dataset of size ~1.5TB had ~50GB of unaccounted space. If someone wants to get rid of this unaccounted space he has to use h5repack. As you have observed the same can now be done using s2f!

Please let us know how it goes.

Regards, Hiruna

RichardCorbett commented 2 years ago

Wonderful thanks @hiruna72 . I hadn't appreciated that the differences I see in repeated guppy runs are due to the sort order of the reads. When I re-sort I can confirm that the resulting .fastq from my original and blow5 cycled fast5 files are the same.

I ran h5stat -S and got the following results:

Summary of file space information	original fast5	blow5 default -> fast5	blow5 vbz -> fast5
File metadata	69768286	95031240	95031240
Raw data	2660055579	2660055579	2660055579
Amount/Percent of tracked free space	0	0	0
Unaccounted space	19923406	176	176
Total space	2749747271	2755086995	2755086995
Disk space	2.9Gb	2.6Gb	2.6Gb

so if I'm interpreting this correctly, there is ~20Mb of unaccounted space in the original .fast5 file. However, it looks like I'm saving ~300Mb in disk space by cycling through blow5. Does this make sense to you?

hiruna72 commented 2 years ago

Hi Richard,

Thank you for summarizing the results. It is really helpful. I assume the Total space in the table is the sum of the Total spaces of each fast5 file (from h5stat). How did you get the Disk space?

RichardCorbett commented 2 years ago

Hi @hiruna72, to get the disk usage of each file I ran du -sh

hasindu2008 commented 2 years ago

@RichardCorbett

I am glad that you guys are trying it out and appreciating it. More and more users using it encourage us to further keep improving the tools and put more and more effort to make the tools even better. Thank you.

To add more to what @hiruna72 said, the HDF5/FAST5 is a very complex format and has a number of different storage allocation schemes and we do not exactly know what kind of parameters ONT's MINKNOW uses as it is closed source. So when generating the fast5s using slow5tools we use the default HDF5 allocation scheme. These file size differences are likely to be due to differences in those storage allocation parameters. If you have a dataset of relatively short reads (e.g., cDNA, virtual amplicons) convert to BLOW5 and you would see that the spacing saving becomes even higher and the converted back fast5 are significantly smaller than original FAST5.

Despite the differences in sizes, the ultimate test would be to basecall the original files and the reconverted files using Guppy and see if the diff passes on sorted fastqs and sorted sequencing summaries. If the diff passes it means all the raw signal data is saved without loss and we do not need to worry.

If you have any more questions or comments please let us know.

hasindu2008 / slow5tools

how to compare fast5 files #50