Closed hengjwj closed 1 year ago
Hey,
Hasindu is in San Fran at a conference so might be a bit slow answering.
Basically we take the data from the first read encountered, then compare incoming reads to see if there are difference that can't be rectified for a stable output.
So when you remove the "bad reads". Those reads together might just similar in "bad" which we can make a stable format.
You can probably merge all groups of good and bad reads afterwards if you can convert them, as once it's in slow5, it's all compatible using our read group method.
Also consider using the --allow flag if it works with what you are trying to do.
Getting slow5tools to be compatible with the many many many changes ont made to fast5 was probably the most time consuming and complicated part of building slow5tools, but also one of the main reasons we made slow5.
Does that help with some explanation? I can go more into specifics if you need to.
Cheers, James
Hey James,
Thanks for the explanation.
For sequencing runs with problematic multi-read FAST5s, would it be better if f2s processes single-read FAST5s (generated with multi_to_single_fast5) in smaller batches (e.g. by the folders generated where "1/" -> "1.blow5")? So, instead of taking the data from the first read encountered, and using that as a reference for incoming reads, how about taking the first read within each batch for subsequent reads in the batch so that fewer reads are lost?
Aside: dummy question. I'm not sure what "run id" does. What was the rationale for making the --allow flag optional instead of on by default?
Appreciate the time and effort you guys put into optimising this for the community!
Joel
Hey,
So run_id is the unique id given to each run. Given each run will have its own global meta data, we add this warning because it alerts the user to a mixed dataset. Even the same flowcell, but the run was restarted, or stopped and started again, will produce a new run_id.
The allow option will just use the meta data from the first encountered read as the default for the whole run. Essentially losing that mixed information.
If you want to keep that info, you would need to split your fast5s by run_id, then convert those batches, then you can merge them at the end, and inside the slow5 file you will have multiple read groups that contain the various meta data for each run_id.
We didn't make --allow
default because usually conversion would be happening on a per run basis, and that merge
should be used to mix runs, rather than the conversion step.
Does that make sense?
I'm about to drive 3h to a conference in Canberra, but I'll have a think about a simple way to pre split the reads by run_id on the drive to help. If you come up with a way though let us know.
James
@hengjwj
Sorry for the delayed response. Was busy with a talk at a conference. Actually, this error about "Different run_ids found in an individual multi-fast5 file" is a different error from what you encountered in #95. What we encountered in the previous issue was "[read_dataset::ERROR] Failed to read raw data from dataset Signal.'. The fix I mentioned there is only suitable for that kind of error, sorry for not being verbose on this.
The error you see here can be fixed in a different way. This new error happens due to run_id being mixed (as @Psy-Fer said), in most cases this is when someone had used ONT's tool to pack single FAST5 to Multi FAST5. For this, I have written a script to handle the issue some time ago. https://github.com/hasindu2008/slow5tools/blob/master/scripts/mixed-multi-fast5-to-blow5.sh
you can call it as mixed-multi-fast5-to-blow5.sh <mult-fast5 directory>
.
Note that the script is NOT an efficient script, but it does the job without losing any reads or run_id information.
Could you launch that script and let me know? You could look at the comments in the script if you would like to understand how it is done. If it is not clear, I can write up an explanation.
Also about the suggestion about getting the first read as a reference- some sort of batching like this was implemented by @hiruna72 but I do not remember the details, can you explain how you implemented handling of mixed datasets?
Hello @hengjwj,
As @Psy-Fer mentioned handling mixed datasets is tricky.
If there are N
and M
number of multi-fast5 and single-fast5 files respectively to be converted by a process, N
blow5s will be created (one for each multi-fast5). The run id of the first read of a multi-fast5 is checked against the rest of the reads in that multi-fast5 file. If there is a mismatch an error will be thrown. If the user still wants to convert he can supply --allow
. Note that it is okay for different multi-fast5 files to have different run_ids. But not in the same file.
However, M
single-fast5 files will be converted to create a single blow5 file.
Similarly, the run_id of the first file will be set in the blow5 header and the rest of the single-fast5 files must have the same run_id. If not an error will be thrown. Again supply --allow
to continue without the error.
The reason for not creating a blow5 for each single-fast5 file as done in the multi-fast5 case is to support the common scenario; where the user converts a folder of single-fast5 files (with the same run_id) to a blow5 file. This will skip the merge/cat step and is faster as we are opening/closing limited number of files.
In total N + 1
blow5 files will be created by the process.
To solve the issue you have, I suggest using the script Hasindu mentioned above. Please let us know it goes.
Thank you.
Hi all,
Sorry for the delay in replying. I ran the script and got the following back:
(np) [user@hpc-amd004 HEK293T-WT-rep3]$ sh mixed-multi-fast5-to-blow5.sh fast5/
4.1.1
h5dump: Version 1.12.2
GNU parallel 20211222
Copyright (C) 2007-2021 Ole Tange, http://ole.tange.dk and Free Software
Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: https://www.gnu.org/software/parallel
When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.
slow5tools 1.0.0
[main] cmd: slow5tools --version
[main] real time = 0.000 sec | CPU time = 0.004 sec | peak RAM = 0.001 GB
| 134 of 134|##############################################################################################################################|100% Time: 0:06:10
Converting tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f
[list_all_items] Looking for '*.fast5' files in tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f/
[f2s_main] 250055 fast5 files found - took 49.204s
[f2s_main] Just before forking, peak RAM = 0.000 GB
[f2s_iop] 16 proceses will be used.
[f2s_iop] Spawning 16 I/O processes to circumvent HDF hell.
[search_and_warn::WARNING] slow5tools-v1.0.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.
...
[search_and_warn::WARNING] slow5tools-v1.0.0: Weird or ancient fast5: converting the attribute /file_version from H5T_IEEE_F64LE to string for consitency.
[read_dataset::ERROR] Failed to read raw data from dataset Signal.
[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5.
[f2s_child_worker::ERROR] Could not read contents of the fast5 file 'tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5'.
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15620, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
[f2s_iop] Child process 14421 exited with status=1.
Converting tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f to BLOW5 failed
(np) [user@hpc-amd004 HEK293T-WT-rep3]$ [f2s_child_worker::INFO] Summary - total fast5: 15629, bad fast5: 0
FAST5 from here: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR470/ERR4706307/HEK293T-WT-rep3.tar.gz
Joel
Sorry for the late response. I have been on flights with no Internet access.
So I downloaded your dataset and here is how I solved it:
run mixed-multi-fast5-to-blow5.sh fast5/
and I faced the same issue as you:
[read_dataset::ERROR] Failed to read raw data from dataset Signal.
[read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de
7c-478f-b74a-ba444a17528e.fast5.
[f2s_child_worker::ERROR] Could not read contents of the fast5 file 'tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5'.
In above error message starts with [read_dataset::ERROR] Failed to read raw data from dataset Signal.
so that means this tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5
read is fully corrupted and must be removed:
rm tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5
Now I look at the tmp_fast5/
:
ls tmp_fast5/
4f128f30d8e1d613c3f5204b914f67830bb9276f 6250ed52a40a7a70570d4d6f9db4dd00ea40f485 e9dce33cf98e372a4f6d3f839b59a70a8f7ae3bb
592a61243d7421ac90e6a183645d5dbf16fc888f 771a38df9d9b678a485923dc79a6a5d5d59c115c
There are 5 run_ids and the tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f//9fce1221-de7c-478f-b74a-ba444a17528e.fast5
is in the first runid. We remove the failed BLOW5 conversion as
rm -r tmp_blow5/4f128f30d8e1d613c3f5204b914f67830bb9276f/
Now that we have deleted the offending FAST5 file in step 2, let us manually launch a conversion on that runID:
slow5tools f2s tmp_fast5/4f128f30d8e1d613c3f5204b914f67830bb9276f -d tmp_blow5/4f128f30d8e1d613c3f5204b914f67830bb9276f -p40
This succeeded
Now launch the conversion for the next runIDs:
slow5tools f2s tmp_fast5/6250ed52a40a7a70570d4d6f9db4dd00ea40f485 -d tmp_blow5/6250ed52a40a7a70570d4d6f9db4dd00ea40f485 -p40
slow5tools f2s tmp_fast5/e9dce33cf98e372a4f6d3f839b59a70a8f7ae3bb -d tmp_blow5/e9dce33cf98e372a4f6d3f839b59a70a8f7ae3bb -p40
slow5tools f2s tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f -d tmp_blow5/592a61243d7421ac90e6a183645d5dbf16fc888f -p40
slow5tools f2s tmp_fast5/771a38df9d9b678a485923dc79a6a5d5d59c115c -d tmp_blow5/771a38df9d9b678a485923dc79a6a5d5d59c115c -p40
In run ID 592a61243d7421ac90e6a183645d5dbf16fc888f
, two reads failed with the same "Failed to read raw data from dataset Signal." error. So I manually deleted the two and repeat conversion for that runID.:
rm tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f/79e10211-bb53-4f7d-ab66-d63affca6cc3.fast5 rm tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f/79e10211-bb53-4f7d-ab66-d63affca6cc3.fast5
rm -r tmp_blow5/592a61243d7421ac90e6a183645d5dbf16fc888f slow5tools f2s tmp_fast5/592a61243d7421ac90e6a183645d5dbf16fc888f -d tmp_blow5/592a61243d7421ac90e6a183645d5dbf16fc888f -p40
Then it succeeded.
7. Now merge the BLOW5 files
slow5tools merge tmp_blow5 -o HEK293T-WT-rep3.blow5 -t40
8. It is a good practice to try indexing, as it will error out if there are any duplicate read IDs (a sanity check to prevent any user mistakes)
slow5tools index HEK293T-WT-rep3.blow5
9. Now let us run a sanity check that counts the number of reads in merged BLOW5, and original multi-fast5
slow5tools stats HEK293T-WT-rep3.blow5 | grep "number of records" | awk '{print $NF}' 513668 find fast5/ -name '.fast5' | parallel -I% --max-args 1 strings % | grep "read_.-.-." | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}' 513670
Note that the above method to get the number of reads in FAST5 above is a horrible hack (there is no other easy way anyway) and thus might be slightly off. But you see that the counts almost match, meaning that things are as good enough.
10. FInally delete the temporary directories
rm -r tmp_blow5 tmp_fast5 tmp_single_fast5
---
Given I converted this now, I will upload the BLOW5 file shortly so that you can just download it and save time.
Happy to help with any future error you would encounter when converting another dataset.
As you may realise, FAST5 is a mess and there are so many complex issues that can arise. That is why we cannot fully rely on an automated script as the number of such complexities is unknown.
In my opinion, all these issues would have not occurred, had those who made FAST5 put some thinking into designing and developing their file format specs, rather than doing ad-hoc implementation-driven development (with POD5, they are repeating the same mistakes as I can see, wait for more fun ahead).
@hengjwj The link is here https://unsw-my.sharepoint.com/:u:/g/personal/z5136909_ad_unsw_edu_au/EUcFf5Z31pVMp8x8O__G8FUBvDIe8nPJAL3t5yv0fMzMqg?download=1
You can use wget as:
wget -O ERR4706307_HEK293T-WT-rep3.blow5 https://unsw-my.sharepoint.com/:u:/g/personal/z5136909_ad_unsw_edu_au/EUcFf5Z31pVMp8x8O__G8FUBvDIe8nPJAL3t5yv0fMzMqg?download=1
@hengjwj Any updates on this? Let me know after you download the file.
Hi Hasindu,
Sorry for the late reply. Yes, I've downloaded the BLOW5 file. Thank you for the detailed explanation and for converting it for me. So from this, I gather I should only delete reads if the error, "Failed to read raw data from dataset Signal" shows up. Thanks again!! I'll close this issue.
Joel
If you face any new issues, feel free to open a new issue. Happy to help.
Hi again @hasindu2008!
After converting two problematic multi-FAST5 to single FAST5s, I attempted f2s like so:
This then returned some errors naming bad reads like so:
I removed the bad read named and repeated this several times, removing the bad read named each time. I removed 10 reads in total from the folder named "q_single/1" before wondering if the same bad reads would be named if f2s were run on individual folders (error above is for folder "q_single/0").
When I attempted f2s on just the "1" folder to speed things up (since the bad reads only came from "1"), after removing only one bad read, it seems to work:
Does that mean the bad reads named previously may not be bad reads after all?
Edit: I used slow5tools 1.0.0 for this.