hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
90 stars 6 forks source link

data loss #70

Closed jrhendrix closed 2 years ago

jrhendrix commented 2 years ago

I converted a test fast5 file to slow5 then back to fast5. I thought that by using the --lossless flag that the final fast5 would be the same as the original; however, it appears that I am loosing data during the conversion.

Is it possible to preserve all of the data from the original fast5?

Commands:

# Convert fast5 to slow5
slow5tools f2s --lossless true -o slow.slow5 original.fast5

# Convert slow5 back to fast5
slow5tools s2f -o new.fast5 slow.slow5

Resulting file sizes original.fast5 97M slow.slow5 116M new.fast5 61M

Psy-Fer commented 2 years ago

Hey,

Thanks for trying out the tools. Could you please let us know what data is missing?

Some fast5 files contain basecalling data. These will not be kept, as they are not related to signal data.

There is also some data packing/unallocated space issues with the existing fast5 files, so even if it doesn't contain basecalling info, just a normal fast5, it will be more bloated space wise than a slow5 converted to fast5.

So looking at space alone doesn't indicate data loss.

Let us know if we can help you look at your example file to investigate further

James

jrhendrix commented 2 years ago

Hi,

The test file did contain basecalling data so this explains a lot! Thank you for explaining.

So looking at space alone doesn't indicate data loss.

Can you suggest a better way to compare the original and twice-converted file?

Thanks

Psy-Fer commented 2 years ago

Both file types consist of meta data and signal associated data. The majority of this data is signal data.

Per read you have a number of primary and auxiliary fields. --lossless means it keeps all the auxiliary fields as well.

We have kept the names the same, so using something like h5py and pyslow5 you could read both files and compare.

An easier way is to just basecall the original, and converted (back and forth) files and compare the fastq files.

@hiruna72 and @hasindu2008 have an example script for this. I'll put it here soon when I'm back at my computer, if they don't beat me too it first.

hasindu2008 commented 2 years ago

@jrhendrix

The ultimate test would be to basecall the original files and the reconverted files using Guppy and see if the diff passes on sorted fastqs and sorted sequencing summaries. If the diff passes it means all the raw signal data is saved without loss and we do not need to worry.

Here is a code snippet:

A=dir_guppy_output_from_slow5_input
B=dir_guppy_output_from_fast5_input

# We sort the fastq files based on the read_ids because the output order from guppy is not deterministic
find $A -name '*.fastq' -exec cat {} + | paste - - - -  | sort -k1,1  | tr '\t' '\n' > a.fastq
find $B -name '*.fastq' -exec cat {} + | paste - - - -  | sort -k1,1  | tr '\t' '\n' > b.fastq

diff -q a.fastq b.fastq || echo "Basecalled fastq files differ"

# We strip out the file names and then sort before comparing
cut -f2,3,5- $A/sequencing_summary.txt | sort -k1 > a.txt
cut -f2,3,5- $B/sequencing_summary.txt | sort -k1 > b.txt

diff -q a.txt b.txt || echo "sequencing summary files differ"

Also, you could refer to #50 for a detailed discussion on the file size differences.

I note that you are explicitly specifying --lossless true which is not really necessary as the default behaviour of slow5tools is to be lossless. Also I recommend using -o slow.blow5 for archiving and processing purposes as this is the binary version of SLOW5 with compression. -o slow.slow5 will produce ASCII version which is meant for human readability. Just like SAM and BAM.

jrhendrix commented 2 years ago

@Psy-Fer @hasindu2008

Running Guppy and comparing the fastq files worked smoothly. Thanks for the code snippet!

I wanted to mention though that the sequencing_summary files did not pass the diff test. When I dug into one of the 'discrepancies', I found that one file had a mean_qscore_template that was 0.000001 higher than the other while everything else was identical. Ugh. I'm not concerned as this seems like a tiny inconsistency with Guppy, but it might be relevant to your script.

Otherwise I'm convinced that the signal data is safe. This is a great tool and I hope ONT picks up on the file format.

Cheers

Psy-Fer commented 2 years ago

Which version of guppy were you using, and were you using more than 1 gpu? (we have seen this before...)

jrhendrix commented 2 years ago

I was running Guppy on a CPU-based system (still working on getting access to a gpu).

Guppy version: 4.5.3+

Psy-Fer commented 2 years ago

Ahh interesting.

What happens when you run it twice on the original fast5 file. Do you see the same issue with the sequencing summary file or does it go away?

(We have seen this issue with 4.* versions of guppy, but saw it go away in versions 5+)

jrhendrix commented 2 years ago

When I reran Guppy on the original fast5 file, the same issue arose-- the sequence_summary files were not the same. Again it was the mean_qscore_template varying by 0.000001 or 0.000002.

Glad to hear that it is resolved in versions 5+

hasindu2008 commented 2 years ago

Great - so that tiny difference is not anything to do with SLOW5, but something with Guppy. We also had some Guppy differences like this when doing sanity checks for converted inhouse data, and finally, it ended up being an issue in Guppy 4 that produced different results each time you run in a multi-gpu environment.

Thanks for doing all these tests. When more and more community members use SLOW5 and see the benefits of it, ONT would hopefully be seriously thinking about providing it as an option. Even if not, not a problem, as we have developed a real-time fast5 to slow5 conversion script that can be run alongside minknow, so users could have SLOW5 at the end of the sequencing run. The script is available in the dev branch at the moment https://github.com/hasindu2008/slow5tools/tree/dev/scripts/realtime-f2s.

jrhendrix commented 2 years ago

that tiny difference is not anything to do with SLOW5, but something with Guppy.

Sounds good! I updated Guppy to v6.0.1 and those slight differences went away.

Thanks for the tool