hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
94 stars 6 forks source link

slow5tools f2s returns different bad reads when run on all single FAST5s vs. when run on an individual folder of single FAST5s #99

Closed hengjwj closed 1 year ago

hengjwj commented 1 year ago

Hi again @hasindu2008!

After converting two problematic multi-FAST5 to single FAST5s, I attempted f2s like so:

multi_to_single_fast5 -t 16 -i q/ -s q_single
slow5tools f2s -p 16 q_single -d q_single_f2s

This then returned some errors naming bad reads like so:

[fast5_attribute_itr::ERROR] Ancient fast5: Different run_ids found in an individual multi-fast5 file. Cannot create a single header slow5/blow5. Consider --allow option.
[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file q_single//0/0002adf6-54fa-4f1f-a070-48b320d3e82b.fast5.
[f2s_child_worker::ERROR] Could not read contents of the fast5 file 'q_single//0/0002adf6-54fa-4f1f-a070-48b320d3e82b.fast5'.

I removed the bad read named and repeated this several times, removing the bad read named each time. I removed 10 reads in total from the folder named "q_single/1" before wondering if the same bad reads would be named if f2s were run on individual folders (error above is for folder "q_single/0").

When I attempted f2s on just the "1" folder to speed things up (since the bad reads only came from "1"), after removing only one bad read, it seems to work:

multi_to_single_fast5 -t 16 -i q/ -s q_single2
slow5tools f2s -p 16 q_single2/1 -d q_single2_f2s_1

[search_and_warn::WARNING] slow5tools-v1.0.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency. This warning is supprerds.ssed now onwards.
...
[search_and_warn::WARNING] slow5tools-v1.0.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency. This warning is supprerds.ssed now onwards.
[f2s_child_worker::INFO] Summary - total fast5: 249, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 250, bad fast5: 0
[f2s_main] Converting 3999 fast5 files took 2.304s
[f2s_main] Children processes: CPU time = 28.997 sec | peak RAM = 0.026 GB

[main] cmd: slow5tools f2s -p 16 q_single2/1 -d q_single_f2s_1

Does that mean the bad reads named previously may not be bad reads after all?

Edit: I used slow5tools 1.0.0 for this.

Psy-Fer commented 1 year ago

Hey,

Hasindu is in San Fran at a conference so might be a bit slow answering.

Basically we take the data from the first read encountered, then compare incoming reads to see if there are difference that can't be rectified for a stable output.

So when you remove the "bad reads". Those reads together might just similar in "bad" which we can make a stable format.

You can probably merge all groups of good and bad reads afterwards if you can convert them, as once it's in slow5, it's all compatible using our read group method.

Also consider using the --allow flag if it works with what you are trying to do.

Getting slow5tools to be compatible with the many many many changes ont made to fast5 was probably the most time consuming and complicated part of building slow5tools, but also one of the main reasons we made slow5.

Does that help with some explanation? I can go more into specifics if you need to.

Cheers, James

hengjwj commented 1 year ago

Hey James,

Thanks for the explanation.

For sequencing runs with problematic multi-read FAST5s, would it be better if f2s processes single-read FAST5s (generated with multi_to_single_fast5) in smaller batches (e.g. by the folders generated where "1/" -> "1.blow5")? So, instead of taking the data from the first read encountered, and using that as a reference for incoming reads, how about taking the first read within each batch for subsequent reads in the batch so that fewer reads are lost?

Aside: dummy question. I'm not sure what "run id" does. What was the rationale for making the --allow flag optional instead of on by default?

Appreciate the time and effort you guys put into optimising this for the community!

Joel

Psy-Fer commented 1 year ago

Hey,

So run_id is the unique id given to each run. Given each run will have its own global meta data, we add this warning because it alerts the user to a mixed dataset. Even the same flowcell, but the run was restarted, or stopped and started again, will produce a new run_id.

The allow option will just use the meta data from the first encountered read as the default for the whole run. Essentially losing that mixed information.

If you want to keep that info, you would need to split your fast5s by run_id, then convert those batches, then you can merge them at the end, and inside the slow5 file you will have multiple read groups that contain the various meta data for each run_id.

We didn't make --allow default because usually conversion would be happening on a per run basis, and that merge should be used to mix runs, rather than the conversion step.

Does that make sense?

I'm about to drive 3h to a conference in Canberra, but I'll have a think about a simple way to pre split the reads by run_id on the drive to help. If you come up with a way though let us know.

James

hasindu2008 commented 1 year ago

@hengjwj

Sorry for the delayed response. Was busy with a talk at a conference. Actually, this error about "Different run_ids found in an individual multi-fast5 file" is a different error from what you encountered in #95. What we encountered in the previous issue was "[read_dataset::ERROR] Failed to read raw data from dataset Signal.'. The fix I mentioned there is only suitable for that kind of error, sorry for not being verbose on this.

The error you see here can be fixed in a different way. This new error happens due to run_id being mixed (as @Psy-Fer said), in most cases this is when someone had used ONT's tool to pack single FAST5 to Multi FAST5. For this, I have written a script to handle the issue some time ago. https://github.com/hasindu2008/slow5tools/blob/master/scripts/mixed-multi-fast5-to-blow5.sh

you can call it as mixed-multi-fast5-to-blow5.sh <mult-fast5 directory>. Note that the script is NOT an efficient script, but it does the job without losing any reads or run_id information. Could you launch that script and let me know? You could look at the comments in the script if you would like to understand how it is done. If it is not clear, I can write up an explanation.

hasindu2008 commented 1 year ago

Also about the suggestion about getting the first read as a reference- some sort of batching like this was implemented by @hiruna72 but I do not remember the details, can you explain how you implemented handling of mixed datasets?

hiruna72 commented 1 year ago

Hello @hengjwj,

As @Psy-Fer mentioned handling mixed datasets is tricky.

If there are N and M number of multi-fast5 and single-fast5 files respectively to be converted by a process, N blow5s will be created (one for each multi-fast5). The run id of the first read of a multi-fast5 is checked against the rest of the reads in that multi-fast5 file. If there is a mismatch an error will be thrown. If the user still wants to convert he can supply --allow. Note that it is okay for different multi-fast5 files to have different run_ids. But not in the same file.

However, M single-fast5 files will be converted to create a single blow5 file. Similarly, the run_id of the first file will be set in the blow5 header and the rest of the single-fast5 files must have the same run_id. If not an error will be thrown. Again supply --allow to continue without the error. The reason for not creating a blow5 for each single-fast5 file as done in the multi-fast5 case is to support the common scenario; where the user converts a folder of single-fast5 files (with the same run_id) to a blow5 file. This will skip the merge/cat step and is faster as we are opening/closing limited number of files.

In total N + 1 blow5 files will be created by the process.

To solve the issue you have, I suggest using the script Hasindu mentioned above. Please let us know it goes.

Thank you.

hengjwj commented 1 year ago

Hi all,

Sorry for the delay in replying. I ran the script and got the following back:

(np) [user@hpc-amd004 HEK293T-WT-rep3]$ sh mixed-multi-fast5-to-blow5.sh fast5/
4.1.1
h5dump: Version 1.12.2
GNU parallel 20211222
Copyright (C) 2007-2021 Ole Tange, http://ole.tange.dk and Free Software
Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: https://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.
slow5tools 1.0.0

[main] cmd: slow5tools --version
[main] real time = 0.000 sec | CPU time = 0.004 sec | peak RAM = 0.001 GB
| 134 of 134|##############################################################################################################################|100% Time: 0:06:10
Converting tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f
[list_all_items] Looking for '*.fast5' files in tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f/
[f2s_main] 250055 fast5 files found - took 49.204s
[f2s_main] Just before forking, peak RAM = 0.000 GB
[f2s_iop] 16 proceses will be used.
[f2s_iop] Spawning 16 I/O processes to circumvent HDF hell.
[search_and_warn::WARNING] slow5tools-v1.0.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.
...
[search_and_warn::WARNING] slow5tools-v1.0.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.
[read_dataset::ERROR] Failed to read raw data from dataset Signal.
[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5.
[f2s_child_worker::ERROR] Could not read contents of the fast5 file 'tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5'.
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15620, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_iop] Child process 14421 exited with status=1.
Converting tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f to BLOW5 failed

(np) [user@hpc-amd004 HEK293T-WT-rep3]$ [f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0

FAST5 from here: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR470/ERR4706307/HEK293T-WT-rep3.tar.gz

Joel

hasindu2008 commented 1 year ago

Sorry for the late response. I have been on flights with no Internet access.

So I downloaded your dataset and here is how I solved it:

  1. run mixed-multi-fast5-to-blow5.sh fast5/ and I faced the same issue as you:

    [read_dataset::ERROR] Failed to read raw data from dataset Signal.
    [read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de
    7c-478f-b74a-ba444a17528e.fast5.
    [f2s_child_worker::ERROR] Could not read contents of the fast5 file 'tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5'.
  2. In above error message starts with [read_dataset::ERROR] Failed to read raw data from dataset Signal. so that means this tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5 read is fully corrupted and must be removed:

    rm tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5
  3. Now I look at the tmp_fast5/:

    ls tmp_fast5/
    4f128f30d8e1d613c3f5204b914f67830bb9276f  6250ed52a40a7a70570d4d6f9db4dd00ea40f485  e9dce33cf98e372a4f6d3f839b59a70a8f7ae3bb
    592a61243d7421ac90e6a183645d5dbf16fc888f  771a38df9d9b678a485923dc79a6a5d5d59c115c

    There are 5 run_ids and the tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5 is in the first runid. We remove the failed BLOW5 conversion as

    rm -r tmp_blow5/4f128f30d8e1d613c3f5204b914f67830bb9276f/
  4. Now that we have deleted the offending FAST5 file in step 2, let us manually launch a conversion on that runID:

slow5tools f2s tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f -d tmp_blow5/4f128f30d8e1d613c3f5204b914f67830bb9276f -p40

This succeeded

  1. Now launch the conversion for the next runIDs:

    slow5tools f2s tmp_fast5/6250ed52a40a7a70570d4d6f9db4dd00ea40f485 -d tmp_blow5/6250ed52a40a7a70570d4d6f9db4dd00ea40f485 -p40
    slow5tools f2s tmp_fast5/e9dce33cf98e372a4f6d3f839b59a70a8f7ae3bb -d tmp_blow5/e9dce33cf98e372a4f6d3f839b59a70a8f7ae3bb -p40
    slow5tools f2s tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f -d tmp_blow5/592a61243d7421ac90e6a183645d5dbf16fc888f -p40
    slow5tools f2s tmp_fast5/771a38df9d9b678a485923dc79a6a5d5d59c115c -d tmp_blow5/771a38df9d9b678a485923dc79a6a5d5d59c115c -p40
  2. In run ID 592a61243d7421ac90e6a183645d5dbf16fc888f, two reads failed with the same "Failed to read raw data from dataset Signal." error. So I manually deleted the two and repeat conversion for that runID.:

rm tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f/79e10211-bb53-4f7d-ab66-d63affca6cc3.fast5 rm tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f/79e10211-bb53-4f7d-ab66-d63affca6cc3.fast5

rm -r tmp_blow5/592a61243d7421ac90e6a183645d5dbf16fc888f slow5tools f2s tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f -d tmp_blow5/592a61243d7421ac90e6a183645d5dbf16fc888f -p40

Then it succeeded.

7. Now merge the BLOW5 files

slow5tools merge tmp_blow5 -o HEK293T-WT-rep3.blow5 -t40


8. It is a good practice to try indexing, as it will error out if there are any duplicate read IDs (a sanity check to prevent any user mistakes)

slow5tools index HEK293T-WT-rep3.blow5


9. Now let us run a sanity check that counts the number of reads in merged BLOW5,  and original multi-fast5

slow5tools stats HEK293T-WT-rep3.blow5 | grep "number of records" | awk '{print $NF}' 513668 find fast5/ -name '.fast5' | parallel -I% --max-args 1 strings % | grep "read_.-.-." | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}' 513670

Note that the above method to get the number of reads in FAST5 above is a horrible hack (there is no other easy way anyway) and thus might be slightly off. But you see that the counts almost match, meaning that things are as good enough.

10. FInally delete the temporary directories

rm -r tmp_blow5 tmp_fast5 tmp_single_fast5



---

Given I converted this now, I will upload the BLOW5 file shortly so that you can just download it and save time. 

Happy to help with any future error you would encounter when converting another dataset.

As you may realise, FAST5 is a mess and there are so many complex issues that can arise. That is why we cannot fully rely on an automated script as the number of such complexities is unknown. 

In my opinion, all these issues would have not occurred, had those who made FAST5 put some thinking into designing and developing their file format specs, rather than doing ad-hoc implementation-driven development (with POD5, they are repeating the same mistakes as I can see, wait for more fun ahead).
hasindu2008 commented 1 year ago

@hengjwj The link is here https://unsw-my.sharepoint.com/:u:/g/personal/z5136909_ad_unsw_edu_au/EUcFf5Z31pVMp8x8O__G8FUBvDIe8nPJAL3t5yv0fMzMqg?download=1

You can use wget as:

 wget -O ERR4706307_HEK293T-WT-rep3.blow5 https://unsw-my.sharepoint.com/:u:/g/personal/z5136909_ad_unsw_edu_au/EUcFf5Z31pVMp8x8O__G8FUBvDIe8nPJAL3t5yv0fMzMqg?download=1
hasindu2008 commented 1 year ago

@hengjwj Any updates on this? Let me know after you download the file.

hengjwj commented 1 year ago

Hi Hasindu,

Sorry for the late reply. Yes, I've downloaded the BLOW5 file. Thank you for the detailed explanation and for converting it for me. So from this, I gather I should only delete reads if the error, "Failed to read raw data from dataset Signal" shows up. Thanks again!! I'll close this issue.

Joel

hasindu2008 commented 1 year ago

If you face any new issues, feel free to open a new issue. Happy to help.