hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
90 stars 6 forks source link

Error encountered during repeated f2s #96

Closed hengjwj closed 1 year ago

hengjwj commented 1 year ago

Hi @hasindu2008,

This is an atypical use case but I was curious to see whether conversion between FAST5 and SLOW5 is truly lossless. The plan was to compare the MD5SUM of FAST5s with and without conversion. So I converted FAST5 to SLOW back and forth as follows:

# iteration 1
slow5tools f2s -p $numcore $dir_fast5 -d $dir_blow5
slow5tools merge -t $numcore $dir_blow5 -o $blow5 -t $numcore && rm -rf $dir_blow5
slow5tools split -g -t $numcore $blow5 -d $dir_blow5_new
slow5tools s2f -p $numcore $dir_blow5_new -d $dir_fast5_new

# iteration 2
slow5tools f2s -p $numcore $dir_fast5_new -d $dir_blow5_new2

No warnings or errors came up in iteration 1 but this came up in iteration 2 and I'm not sure why:

[list_all_items] Looking for '*.fast5' files in fast51
[f2s_main] 2 fast5 files found - took 0.002s
[f2s_main] Just before forking, peak RAM = 0.000 GB
[f2s_iop] 2 proceses will be used.
[f2s_iop] Spawning 2 I/O processes to circumvent HDF hell.
[add_aux_slow5_attribute::ERROR] Could not set the slow5 record auxiliary attribute 'Raw/end_reason' in fast51/xen_s9_r1_50k_0.fast5
[fast5_attribute_itr::ERROR] Could not add the auxiliary attribute Raw/end_reason in fast51/xen_s9_r1_50k_0.fast5 to the slow5 record
[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file fast51/xen_s9_r1_50k_0.fast5.
[f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file 'fast51/xen_s9_r1_50k_0.fast5'.
[f2s_child_worker::INFO] Summary - total fast5: 1, bad fast5: 0
[f2s_iop] Child process 378491 exited with status=1.

Do you see anything wrong with my code? I'm using slow5tools 1.0.0.

Joel

Psy-Fer commented 1 year ago

Hey there,

MD5SUM won't match, as there are some other things going on that don't have anything to do with the actual data.

For more info on that, please see issues https://github.com/hasindu2008/slow5tools/issues/70, https://github.com/hasindu2008/slow5tools/issues/76, and https://github.com/hasindu2008/slow5tools/issues/58

It's better to basecall the resulting files, then compare the fastq files. You can see how to do this here https://hasindu2008.github.io/slow5tools/archive.html

As for the error above, I'll let @hasindu2008 look into that.

Cheers, James

hasindu2008 commented 1 year ago

Is this the same dataset as yesterday or is this something different? I would like to reproduce it on my side, as that error should not have happened.

hengjwj commented 1 year ago

Thanks for the quick responses!

It's a different dataset that can be found in this Code Ocean capsule (was generated by our lab): https://codeocean.com/capsule/4038948/tree/v1 under /data/xen_s9_r1_50k.

Joel

hengjwj commented 1 year ago

Thanks for the tip!

Hey there,

MD5SUM won't match, as there are some other things going on that don't have anything to do with the actual data.

For more info on that, please see issues #70, #76, and #58

It's better to basecall the resulting files, then compare the fastq files. You can see how to do this here https://hasindu2008.github.io/slow5tools/archive.html

As for the error above, I'll let @hasindu2008 look into that.

Cheers, James

hasindu2008 commented 1 year ago

Hi @hengjwj

I was able to reproduce the error. It is a bug when writing FAST5 files in s2f module. This attribute called end_reason is not present in your original FAST5 files (ONT introduced this 2 years ago). This is handled properly when converting from FAST5->S/BLOW5 files. However, when doing S/BLOW5->FAST5 an end_reason with -1 value gets written to FAST5, which should not have been the case. That is why the newly FAST5 files created fail to convert. If you try to basecall the newly created fast5 files in fast51/ using Guppy, it will still succeed I think.

@hiruna72 could you please fix this in the dev branch or a separate branch. Thanks @hengjwj finding this and apologies for the trouble.

hasindu2008 commented 1 year ago

@hengjwj The error has been fixed in the latest dev branch. For your convenience, I am attaching binaries compiled on my laptop which is likely to work on yours. Could you do the "s2f" command in your pipeline using this latest version? For the other commands, you can continue to use v1.0.0 which is well-tested. slow5tools-v1.0.0-16-g39e3f5d-x86_64-linux-binaries.tar.gz

But note that as @Psy-Fer said, md5sum will not work for comparing FAST5 files with the same content as explained in Q10 ar https://hasindu2008.github.io/slow5tools/faq.html. The ultimate test will be to basecall and compare (as in "Sanity check by basecalling" section under https://hasindu2008.github.io/slow5tools/archive.html which is more complex than a simple diff)

hasindu2008 commented 1 year ago

@hengjwj

Did the above suggestion work? Also, have you downloaded the BLOW5 file from the AWS (associated with your other issue), if so I could delete it?

hengjwj commented 1 year ago

Hi @hasindu2008,

I tried using your compiled slow5tools (slow5tools 1.0.0-dirty) to do FAST5->BLOW5->FAST5->guppy as follows:

slow5tools f2s -p 16 -d blow5 fast5/
slow5tools merge -t 16 blow5/ -o blow51.blow5
slow5tools split -g -d blow51_split -t 16 blow51.blow5
slow5tools s2f -p 16 -d fast52 blow51_split/
guppy_basecaller -i fast52 -s guppy_f2s2f --flowcell FLO-MIN106 --kit SQK-RNA002 -x cuda:0 -r --disable_qscore_filtering

However, I encountered multiple similar-looking errors. Here's a snippet:

[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "a2a8038f-b9d5-4335-bc10-a6cc7a4176e2" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.
**[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "ba1ebed6-d838-4b83-a7aa-f1958e5e536c" from file "fast52/blow51_1.fast5": Fast
5 read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "c0e8d976-544e-4c5c-9bb5-ea3b24801b29" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "c531897a-eae0-453b-a9b0-63bde12ad744" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.

Is the split by read group (-g) ok or should it be split some other way?

Joel

hengjwj commented 1 year ago

@hengjwj

Did the above suggestion work? Also, have you downloaded the BLOW5 file from the AWS (associated with your other issue), if so I could delete it?

Hi Hasindu, I've downloaded it. Thanks again!

hengjwj commented 1 year ago

Hi @hasindu2008,

I tried using your compiled slow5tools (slow5tools 1.0.0-dirty) to do FAST5->BLOW5->FAST5->guppy as follows:

slow5tools f2s -p 16 -d blow5 fast5/
slow5tools merge -t 16 blow5/ -o blow51.blow5
slow5tools split -g -d blow51_split -t 16 blow51.blow5
slow5tools s2f -p 16 -d fast52 blow51_split/
guppy_basecaller -i fast52 -s guppy_f2s2f --flowcell FLO-MIN106 --kit SQK-RNA002 -x cuda:0 -r --disable_qscore_filtering

However, I encountered multiple similar-looking errors. Here's a snippet:

[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "a2a8038f-b9d5-4335-bc10-a6cc7a4176e2" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.
**[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "ba1ebed6-d838-4b83-a7aa-f1958e5e536c" from file "fast52/blow51_1.fast5": Fast
5 read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "c0e8d976-544e-4c5c-9bb5-ea3b24801b29" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "c531897a-eae0-453b-a9b0-63bde12ad744" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.

Is the split by read group (-g) ok or should it be split some other way?

Joel

Sorry, just saw the "Sanity check by basecalling" section has a specific way of splitting. Let me give that a go and get back to you.

hengjwj commented 1 year ago

Hi @hasindu2008, I tried using your compiled slow5tools (slow5tools 1.0.0-dirty) to do FAST5->BLOW5->FAST5->guppy as follows:

slow5tools f2s -p 16 -d blow5 fast5/
slow5tools merge -t 16 blow5/ -o blow51.blow5
slow5tools split -g -d blow51_split -t 16 blow51.blow5
slow5tools s2f -p 16 -d fast52 blow51_split/
guppy_basecaller -i fast52 -s guppy_f2s2f --flowcell FLO-MIN106 --kit SQK-RNA002 -x cuda:0 -r --disable_qscore_filtering

However, I encountered multiple similar-looking errors. Here's a snippet:

[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "a2a8038f-b9d5-4335-bc10-a6cc7a4176e2" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.
**[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "ba1ebed6-d838-4b83-a7aa-f1958e5e536c" from file "fast52/blow51_1.fast5": Fast
5 read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "c0e8d976-544e-4c5c-9bb5-ea3b24801b29" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "c531897a-eae0-453b-a9b0-63bde12ad744" from file "fast52/blow51_1.fast5": Fast5
read file is invalid. Raw data field 'median_before' has wrong type.

Is the split by read group (-g) ok or should it be split some other way? Joel

Sorry, just saw the "Sanity check by basecalling" section has a specific way of splitting. Let me give that a go and get back to you.

Splitting it by read group and in 4000 reads/file produces similar error as when I split by read group only:

[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "216ac7cf-b561-439c-ab9b-c9e1359cce66" from file "s2f_fast5/blow51_0_6.fast5": Fast5 read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "248e7c3b-6e85-4174-878e-9edf9e4b459f" from file "s2f_fast5/blow51_0_6.fast5": Fast5 read file is invalid. Raw data field 'median_before' has wrong type.
[guppy/warning] data_source::FileDataHandler1D::load_read: Error loading read "2c6be16e-0a9e-4064-85d9-73dfedeba2eb" from file "s2f_fast5/blow51_0_6.fast5": Fast5 read file is invalid. Raw data field 'median_before' has wrong type.
hiruna72 commented 1 year ago

Hello @hengjwj,

What is the guppy version you are using?

hengjwj commented 1 year ago

Hi @hiruna72,

I'm using Guppy 5.0.11.

Joel

hasindu2008 commented 1 year ago

@hengjwj It seems like median_before in some of the reads in the original FAST5s are actually set to "nan". Now it should work in this latest fix - would you mind testing it?

slow5tools-v1.0.0-22-g499683b-x86_64-linux-binaries.tar.gz

Apologies for taking your time, but this greatly helps us improve the tool for these old FAST5s.

hengjwj commented 1 year ago

Hey @hasindu2008,

Happy to help! Lemme give it a go.

Joel

hengjwj commented 1 year ago

Hi @hasindu2008,

It appears to be working in this small test dataset. Here's a summary of what I did:

#slow5tools to do FAST5->SLOW5->FAST5 conversion
~/bin/slow5tools-v1.0.0-22-g499683b/slow5tools f2s -p 16 -d blow51_new fast5/
~/bin/slow5tools-v1.0.0-22-g499683b/slow5tools merge -t 16 -o blow51.blow5 blow51_new
~/bin/slow5tools-v1.0.0-22-g499683b/slow5tools split blow51.blow5 -d split_groups_tmp/ -g -t 16
~/bin/slow5tools-v1.0.0-22-g499683b/slow5tools split split_groups_tmp/ -d split_reads_tmp/ -r 4000 -t 16
rm -r split_groups_tmp/
~/bin/slow5tools-v1.0.0-22-g499683b/slow5tools s2f split_reads_tmp -d s2f_fast5/ -p 16
rm -r split_reads_tmp/

#basecall newly re-generated FAST5
guppy_basecaller -i s2f_fast5/ -s guppy_s2f --flowcell FLO-MIN106 --kit SQK-RNA002 -x cuda:0 -r --disable_qscore_filtering

#compare basecalled FAST5 from earlier with newly basecalled FAST5; from https://hasindu2008.github.io/slow5tools/archive.html
A=guppy_fast5
B=guppy_s2f

test -e $A || die "$A not present."
test -e $B || die "$B not present."

find $A -name '*.fastq' -exec cat {} + | paste - - - -  | LC_ALL=C sort -k1,1  | tr '\t' '\n' > a.fastq
find $B -name '*.fastq' -exec cat {} + | paste - - - -  | LC_ALL=C sort -k1,1  | tr '\t' '\n' > b.fastq

diff -q a.fastq b.fastq || { echo "Basecalls differ"; exit 1; }

cut -f2,3,5- $A/sequencing_summary.txt | LC_ALL=C sort -k1  > a.txt
cut -f2,3,5- $B/sequencing_summary.txt | LC_ALL=C sort -k1  > b.txt

diff -q a.txt b.txt  || { echo "sequencing summary files differ"; exit 1; }

None of the messages printed. Additionally, file size of a.fastq == b.fastq, and that of a.txt == b.txt. Awesome!

For existing BLOW5 files, do they need to be regenerated using slow5tools-v1.0.0-22-g499683b or did the previous errors only affect the split and s2f parts of the programme?

Joel

Psy-Fer commented 1 year ago

Hey, nice work.

Yea it was in how the field was handled in s2f. Any data previously converted with f2s should be fine.

Basically when the value is NaN or some similar null value, we use . in slow5. When converting back, we didn't handle that correctly and guppy got mad (guppy should have been coded to handle that better, but we can't control that).

Thanks so much for your patience and working with us to sort this out. We really appreciate it.

Cheers, James

hasindu2008 commented 1 year ago

Sweet! To add to what @Psy-Fer said, f2s, merge and split all work well in the previous version and there is no need for reconversions. In fact, I suggest sticking to that old release version for the actual FAST->BLOW5 conversion as it is tested on many datasets. It is only for converting to FAST5 (s2f) you need to use the development version.

Also, there is a possibility to directly basecall from BLOW5 through Guppy using the buttery-eel wrapper (https://github.com/Psy-Fer/buttery-eel/). In fact much more efficient and faster [https://academic.oup.com/bioinformatics/article/39/6/btad352/7186500]. We always use this to do direct basecalling from BLOW5 itself without needing to convert back to FAST5 and that is one of the reasons why this s2f program's anomalies were not exposed before. Note that due to differences in metadata tags output buttery-eel vs Guppy, the above comparison method will not work. How I compare buttery-eel with Guppy is using a method like this https://github.com/hasindu2008/biorand/blob/master/bin/identityrna.sh that computes the identify score stats by mapping reads using minimap2. I run this script separately on the FASTQ output by Guppy and FASTQ buttery-eel. Then I check if the stats are similar.

Thank you @Psy-Fer and @hiruna72 for helping with solving this issue promptly.

hengjwj commented 1 year ago

@hasindu2008,

I tried buttery-eel on our computer cluster but couldn't get the basecall server to work so I decided to just basecall the FAST5 first then convert to BLOW5 for f5c and storage. Thanks for the suggestion nonetheless, and also for refining slow5tools! Will close this issue. Thanks again!

Joel

Psy-Fer commented 1 year ago

Hey Joel,

If you have any issues with buttery-eel, feel free to create an issue over on the repo. I'll be happy to help you there.

Cheers, James

hengjwj commented 1 year ago

Hey Joel,

If you have any issues with buttery-eel, feel free to create an issue over on the repo. I'll be happy to help you there.

Cheers, James

Hey James, thanks for offering! Maybe I'll give it another go in the future.

Joel

hasindu2008 commented 1 year ago

Thanks, this error could be quite useful to improve the buttery-eel. It works well on the two HPC systems we have access to, but your environment could be slightly different. It is good to get it working on as many HPC systems as possible.