Closed Yang990-sys closed 7 months ago
Hi @Yang990-sys thank you for reporting this. Nothing should have changed between 0.5.2 and 0.4.3 for RNA. I will have a look at this tomorrow and get back to you. This clearly looks like a regression.
Hi @Yang990-sys - I think I've found the bug. We have an issue with trimming. Can you run your new dataset with dorado v0.5.0? That shouldn't have the problem and will confirm my suspicion.
Hi @Yang990-sys - I think I've found the bug. We have an issue with trimming. Can you run your new dataset with dorado v0.5.0? That shouldn't have the problem and will confirm my suspicion.
You're right. dorado v0.5.0 does not have this issue.
Thank you for confirming! We're working on a patch fix and will release is in a couple of days.
Hi @Yang990-sys - dorado v0.5.3 was just released which fixes this issue. You can download it from here - https://github.com/nanoporetech/dorado?tab=readme-ov-file#installation . Thank you for your patience!
Dear Dorado team,
First, I want to express my sincere appreciation for developing such a valuable tool - Dorado!
I'm writing to inquire about a potential issue I'm encountering while using Dorado 0.53 to analyse direct RNA-seq datasets generated from Arabidopsis thaliana total RNA (with RNA004 kit).
Issue: When examining the poly(A) length distribution in my data, I observe a bimodal pattern peaking at <10nt and ~100nt (attached) instead of the expected single distribution around 60-70nt for Arabidopsis total RNA (from current literature). This distribution resembles the issue reported in issue #588 . Additionally, the number of reads containing poly(A) tails is significantly lower than anticipated (attached).
Context: I have base-called two such datasets using Dorado 0.53. I plan to generate several more datasets in the coming weeks and am concerned about these observations impacting my downstream analyses.
Question: I'm unsure if this is an intrinsic issue with my experimental setup or a potential analytical problem with Dorado 0.53. Could you please advise on this and suggest any troubleshooting steps or alternative approaches?
Additional information:
I've attached the poly(A) length distribution plot and information about the command line used for analysis. Thank you once again for your time and assistance.
Sincerely,
Ranj
Commad used: dorado basecaller -r -v --min-qscore 10 --estimate-poly-a rna004_130bps_sup@v3.0.1 pod5/ > dorado0.53_polya_Q10.bam
Dorado version
Sample 1 Sample 2
Poly A tail length distribution
Hi @Papareddy - would you be able to call this dataset with dorado v0.5.0 as well? 0.5.3 should have fixed a regression we introduced but wondering if there's something more nuanced going on there -
Here's where you can download those binaries -
Hi @tijyojwad,
Thanks again for your quick response!
I'm attaching information for sample -2, and, the results appear similar to the dorado 0.53. We're still seeing a low number of reads with poly(A) tails and a bimodal distribution like before (attached).
It appears ribosomal RNA (rRNA) derived reads are in substantial numbers . For reference, I've attached an IGV view showing one of the highly abundant rRNA loci. I was under the impression only poly A RNAs are enriched with oligo DT priming even when using total RNA in RNA004 kit. While this could (perhaps) explain the low number of poly A reads it doesn't explain the distribution!
Could you please give me you thoughts on this
I appreciate your continued help!
Sincerely,
Ranj
rRNA locus
Distribution
metrics
Looking at the logs, it looks like dorado isn't calling 80% of the tails. This is much higher than what we'd expect (that number should be < 5% or so).
Is only a portion of your data expected to have tails? Since you're seeing the same distribution across samples there might actually be a peak around 100. Would you be open to sharing a pod5 with a few samples from the ~100 population and the ~10 population?
While this could (perhaps) explain the low number of poly A reads it doesn't explain the distribution!
Yep I'm somewhat dubious of the first peak as well, and I suspect it may have something to do with our RNA adapter detection mechanism. We have made some improvements internally that we hope to release soon, and it's possible that first peak is already taken care of. If you are able to share a few reads that show up from the first peak, I can verify that
With the exception of ribosomal RNA, all of my messenger RNAs (mRNAs) are anticipated to possess poly A tails! I employed dorado recursively on a directory of pod5 files, and presently, I'm uncertain about which pod files contain reads with low poly A tail lengths. I will extract the reads based on these sizes and provide them to you. Thank you very much Joyjit for your consistently prompt responses!
Cheers, Ranj
thanks @Papareddy - you can determine the pod5 name from the BAM record. Each record will have a fn:Z
tag which has the name of the POD5 from which that read came. So if you can send me a few reads where pt:i
tag < 20 that would be great.
With the exception of ribosomal RNA, all of my messenger RNAs (mRNAs) are anticipated to possess poly A tails! I employed dorado recursively on a directory of pod5 files, and presently, I'm uncertain about which pod files contain reads with low poly A tail lengths. I will extract the reads based on these sizes and provide them to you. Thank you very much Joyjit for your consistently prompt responses!
Cheers, Ranj
Hello Ranj,
I guess the difference may come from your sample itself. I am the company analyst in the issue #588. After reanalyzing the data using dorado v0.5.3, both the problems of poly(A)'s detection rate and poly(A)'s length distribution have been solved. The following figure shows the distribution of polyA length after reanalysis, which is consistent with the results of dorado v0.4.3 in issue #588.
thanks @Papareddy - you can determine the pod5 name from the BAM record. Each record will have a
fn:Z
tag which has the name of the POD5 from which that read came. So if you can send me a few reads wherept:i
tag < 20 that would be great.
Hello, tijyojwad, does this mean that I still need to trim adapter with dorado trim ? And after trimming ,the position of m6A has not been changed, and the redundant sequence after poly(A) has not been trimmed either. But the bam file is indeed different, but it may also be caused by different header. How can I obtain the correct trimmed sequence?
And This is my trim code
Hello,
The same data and script show a significant difference in the number of polyA recognition between 0.4.3 and 0.5.2. Dorado 0.4.3 detected 99% polyA , while Dorado 0.5.2 only detected 20% of the total data, which is very similar to issue #588 .
At the same time, I found that although Dorado's documentation mentioned RNA adapters are always trimmed, in the basecall result of 0.5.2, the adapters were not completely trimmed, which is different from the result of 0.4.3. Will this be the reason for the low detection rate in 0.5.2 ? Which result should I use?
0.5.2 reads: 0.4.3 reads:
0.5.2 polyA rate: 22% 0.4.3 polyA rate 99%
0.5.2 bascall scripts: ~/soft/dorado-0.5.2-linux-x64/bin/dorado basecaller -x cuda:3 --estimate-poly-a ~/soft/model/rna004_130bps_sup@v3.0.1 pod5_pass/ --modified-bases-models ~/soft/model/rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 >pass.bam 0.4.3 bascall scripts: ~/soft/dorado-0.4.3-linux-x64/bin/dorado basecaller -x cuda:3 --estimate-poly-a ~/soft/model/rna004_130bps_sup@v3.0.1 pod5_pass/ --modified-bases-models ~/soft/model/rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 >pass.bam