Closed tjprins closed 5 months ago
Hi @tjprins
Would you mind grepping this "261cc311-3a2c-40bc-ab69-df48a048aa2c" from the FASTQ? I reckon these are "split reads" where new Dorado splits a given read into two reads when basecalling. See if the FASTQ header has something like a different parent read ID?
Which version of Dorado are you using? I can try on my end too.
By the way, I see that a lot of reads have been skipped as < mapq 20. You can ask f5c to include in the analysis through --min-mapq 0. M6anet authors have tested this and have confirmed mapq 0 threshold is all right https://github.com/hasindu2008/f5c/issues/154#issuecomment-1960747253.
Hi hasindu2008, thanks for such a quick response.
Would you mind grepping this "261cc311-3a2c-40bc-ab69-df48a048aa2c" from the FASTQ? I reckon these are "split reads" where new Dorado splits a given read into two reads when basecalling. See if the FASTQ header has something like a different parent read ID?
I saw a thread about this here when I was searching my issue. I looked into dorado's documentation as best I could to try to find a way to not split the reads, but I didn't see anything. If I run this data through guppy -> nanopolish I don't get this error, so you could be right, but I need dorado since guppy doesn't have a model for the new chemistry.
Below is the output of grep '261cc311-3a2c-40bc-ab69-df48a048aa2c' FA68LB4766WT624_master.fastq
@261cc311-3a2c-40bc-ab69-df48a048aa2c st:Z:2024-06-21T00:16:42.545+00:00 RG:Z:4f29a5c9f8af0ea181004166b475ce6b83155cfc_rna004_130bps_fast@v5.0.0 DS:Z:gpu:NVIDIA GeForce RTX 3090 Ti
Which version of Dorado are you using? I can try on my end too.
I am using dorado 0.7.0. If you want me to make these files available to you so you can try on your end I can certainly do that.
By the way, I see that a lot of reads have been skipped as < mapq 20. You can ask f5c to include in the analysis through --min-mapq 0. M6anet authors have tested this and have confirmed mapq 0 threshold is all right https://github.com/hasindu2008/f5c/issues/154#issuecomment-1960747253.
Good to know, I will add that into my pipeline. For what it's worth, when I tried this, I got the same result.
Thanks again for all your help. I know it's gotta be rough to help us all troubleshoot our issues, but it makes a huge difference!!
Hello @tjprins
While it is time consuming to troubleshoot issues, it is also a pleasure to help the community in the way we can for the software we developed and continue to maintain.
I tested on my rna004 dataset using latest dorado and it seems read splitting is enabled by default (as we guessed).
I called:
/install/dorado-0.7.2-linux-x64/bin/dorado basecaller /data/install/dorado-0.7.2-linux-x64/model/rna004_130bps_fast@v5.0.0/ pod5_shit/ > tmp.sam
samtools view tmp.sam | cut -f 1 > tmp.list
slow5tools get PNXRXX240010_reads_20k.blow5 --list tmp.list > /dev/null
and it gave like:
[slow5_idx_get::ERROR] Read ID 'f64dbcee-68ae-4cc4-8a86-2eef57d3dc14' was not found. At src/slow5_idx.c:539
Now grepping that read ID in sam file:
samtools view tmp.sam | grep "f64dbcee-68ae-4cc4-8a86-2eef57d3dc14"
f64dbcee-68ae-4cc4-8a86-2eef57d3dc14 4 * 0 0 * * 0 0 GGCGGCGGCGGCGGCCA """"""""""""""""" qs:f:1 du:f:2.07425 ns:i:8297 ts:i:0 mx:i:4 ch:i:1177 st:Z:2024-01-11T07:50:55.557+00:00 rn:i:-1 fn:Z:PNXRXX240010_reads_20k.pod5 sm:f:806.16 sd:f:118.119 sv:Z:pa dx:i:0 RG:Z:844f1e7a1a1f3f41ac89a9971f789f21b3e161d2_rna004_130bps_fast@v5.0.0 pi:Z:ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10 sp:i:0
That pi:Z:ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10
indicates that this missing read is indeed a split read whose parent is ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10
.
This parent read id available in the slow5 file as indicated by:
slow5tools get PNXRXX240010_reads_20k.blow5 ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10 > /dev/null
[main] cmd: slow5tools get PNXRXX240010_reads_20k.blow5 ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10
[main] real time = 0.009 sec | CPU time = 0.010 sec | peak RAM = 0.006 GB
Now given that I confirmed that this is due to split reads, you can continue to simply ignore these missing reads in f5c. From your report it seems like only 23954 bad reads (which are these missing reads) out of 650162 which is like 4% of reads being lost.
[meth_main] total entries: 650162, qc fail: 115, could not calibrate: 16766, no alignment: 39878, bad reads: 23954
In theory, I could handle these parent reads ids in f5c, but this is not practically possible because this parent read id information is missing in the fastq format. Even if it was there, nanopore will keep on changing the names and conventions, making it a time consuming to maintain.
It appears that Dorado does not have an option to disable read splitting when I skimmed through the options. The easiest option is simply ignoring the lost 4% of reads. I know those missing reads spams the terminal with errors and warnings, but in the next release I am going to suppress these warnings for split reads and simply print a single warning at the end for the number of skipped reads. Alternatively, you use the buttery-eel which uses dorado server to directly basecall blow5 files. I tested with Dorado server 7.2.13 as:
/install/buttery-eel-0.4.2+dorado7.2.13/scripts/eel --config rna_rp4_130bps_fast.cfg -i PNXRXX240010_reads_20k.blow5 -o tmp.fastq -x cuda:all
and the read splitting option is disabled by default. rna_rp4_130bps_fast.cfg
is for rna004, but the model version is rna004_130bps_fast@v3.0.1
which is behind the rna004_130bps_fast@v5.0.0
the latest Dorado has. I haven't got to test buttery-eel on latest dorado-server to confirm what the read splitting behaviour is, but @Psy-Fer should be able to comment on.
A lot of information here, hope I was clear.
Thanks, Hasindu. I followed your example and gave buttery-eel a shot with the dorado-server and can confirm that I am no longer having the issue and that all of my reads are now being faithfully reported. Thank you so much for your assistance, this has been immensely helpful. I will probably use dorado-server going forward (especially because I can use it directly on blow5 files with buttery-eel).
Great! I will close the issue. If you run into another issue or any other question, feel free to open a new issue.
Hello,
I am using m6ANet to detect m6A modifications in RNA and recently swapped over to the new nanopore RNA004 chemistry/kit. Because of this, my pipeline uses ONT's new basecaller, dorado, and I am using f5c instead of nanopolish, per the updated m6anet documentation. Everything seems to go smoothly with the new pipeline until I get to the f5c eventalign step, in which I get a lot of these messages:
At the end of the eventalign execution, it spits out the following:
Not sure what is going on, but it looks like it is having trouble finding a lot of the reads after basecalling or pod5 -> blow5 conversion. My pipeline is the following: basecalling (dorado, pod5 as input) > alignment (minimap2) > pod5 to blow5 conversion (blue-crab) > f5c (index) > f5c (eventalign). For inclusion, I have included the commands and output of each of these steps from the console below.
Any insight or help is very much appreciated. Thank you!
Dorado:
Minimap2
Blue-crab
f5c (index)