Poly(A) issue about RNA004

Yang990-sys commented 7 months ago

Hello,

The same data and script show a significant difference in the number of polyA recognition between 0.4.3 and 0.5.2. Dorado 0.4.3 detected 99% polyA , while Dorado 0.5.2 only detected 20% of the total data, which is very similar to issue #588 .

At the same time, I found that although Dorado's documentation mentioned RNA adapters are always trimmed, in the basecall result of 0.5.2, the adapters were not completely trimmed, which is different from the result of 0.4.3. Will this be the reason for the low detection rate in 0.5.2 ? Which result should I use?

0.5.2 reads: 企业微信截图_17068453579782 0.4.3 reads: 企业微信截图_17068454516884

0.5.2 polyA rate: 22% 0.4.3 polyA rate 99%

0.5.2 bascall scripts: ~/soft/dorado-0.5.2-linux-x64/bin/dorado basecaller -x cuda:3 --estimate-poly-a ~/soft/model/rna004_130bps_sup@v3.0.1 pod5_pass/ --modified-bases-models ~/soft/model/rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 >pass.bam 0.4.3 bascall scripts: ~/soft/dorado-0.4.3-linux-x64/bin/dorado basecaller -x cuda:3 --estimate-poly-a ~/soft/model/rna004_130bps_sup@v3.0.1 pod5_pass/ --modified-bases-models ~/soft/model/rna004_130bps_sup@v3.0.1_m6A_DRACH@v1 >pass.bam

tijyojwad commented 7 months ago

Hi @Yang990-sys thank you for reporting this. Nothing should have changed between 0.5.2 and 0.4.3 for RNA. I will have a look at this tomorrow and get back to you. This clearly looks like a regression.

tijyojwad commented 7 months ago

Hi @Yang990-sys - I think I've found the bug. We have an issue with trimming. Can you run your new dataset with dorado v0.5.0? That shouldn't have the problem and will confirm my suspicion.

Yang990-sys commented 7 months ago

Hi @Yang990-sys - I think I've found the bug. We have an issue with trimming. Can you run your new dataset with dorado v0.5.0? That shouldn't have the problem and will confirm my suspicion.

You're right. dorado v0.5.0 does not have this issue.

tijyojwad commented 7 months ago

Thank you for confirming! We're working on a patch fix and will release is in a couple of days.

tijyojwad commented 7 months ago

Hi @Yang990-sys - dorado v0.5.3 was just released which fixes this issue. You can download it from here - https://github.com/nanoporetech/dorado?tab=readme-ov-file#installation . Thank you for your patience!

Papareddy commented 7 months ago

Dear Dorado team,

First, I want to express my sincere appreciation for developing such a valuable tool - Dorado!

I'm writing to inquire about a potential issue I'm encountering while using Dorado 0.53 to analyse direct RNA-seq datasets generated from Arabidopsis thaliana total RNA (with RNA004 kit).

Issue: When examining the poly(A) length distribution in my data, I observe a bimodal pattern peaking at <10nt and ~100nt (attached) instead of the expected single distribution around 60-70nt for Arabidopsis total RNA (from current literature). This distribution resembles the issue reported in issue #588 . Additionally, the number of reads containing poly(A) tails is significantly lower than anticipated (attached).

Context: I have base-called two such datasets using Dorado 0.53. I plan to generate several more datasets in the coming weeks and am concerned about these observations impacting my downstream analyses.

Question: I'm unsure if this is an intrinsic issue with my experimental setup or a potential analytical problem with Dorado 0.53. Could you please advise on this and suggest any troubleshooting steps or alternative approaches?

Additional information:

I've attached the poly(A) length distribution plot and information about the command line used for analysis. Thank you once again for your time and assistance.

Sincerely,

Ranj

Commad used: dorado basecaller -r -v --min-qscore 10 --estimate-poly-a rna004_130bps_sup@v3.0.1 pod5/ > dorado0.53_polya_Q10.bam

Dorado version which_dorado

Sample 1 CT_polyA Sample 2 CD_polyA

Poly A tail length distribution polyA_distribution

tijyojwad commented 7 months ago

Hi @Papareddy - would you be able to call this dataset with dorado v0.5.0 as well? 0.5.3 should have fixed a regression we introduced but wondering if there's something more nuanced going on there -

Here's where you can download those binaries -

Papareddy commented 7 months ago

Hi @tijyojwad,

Thanks again for your quick response!

I'm attaching information for sample -2, and, the results appear similar to the dorado 0.53. We're still seeing a low number of reads with poly(A) tails and a bimodal distribution like before (attached).

It appears ribosomal RNA (rRNA) derived reads are in substantial numbers . For reference, I've attached an IGV view showing one of the highly abundant rRNA loci. I was under the impression only poly A RNAs are enriched with oligo DT priming even when using total RNA in RNA004 kit. While this could (perhaps) explain the low number of poly A reads it doesn't explain the distribution!

Could you please give me you thoughts on this

I appreciate your continued help!

Sincerely,

Ranj

rRNA locus Screenshot 2024-02-14 at 18 23 13

Distribution

metrics PolyA_dorado_0 5

dorado0 5

tijyojwad commented 7 months ago

Looking at the logs, it looks like dorado isn't calling 80% of the tails. This is much higher than what we'd expect (that number should be < 5% or so).

Is only a portion of your data expected to have tails? Since you're seeing the same distribution across samples there might actually be a peak around 100. Would you be open to sharing a pod5 with a few samples from the ~100 population and the ~10 population?

tijyojwad commented 7 months ago

While this could (perhaps) explain the low number of poly A reads it doesn't explain the distribution!

Yep I'm somewhat dubious of the first peak as well, and I suspect it may have something to do with our RNA adapter detection mechanism. We have made some improvements internally that we hope to release soon, and it's possible that first peak is already taken care of. If you are able to share a few reads that show up from the first peak, I can verify that

Papareddy commented 7 months ago

With the exception of ribosomal RNA, all of my messenger RNAs (mRNAs) are anticipated to possess poly A tails! I employed dorado recursively on a directory of pod5 files, and presently, I'm uncertain about which pod files contain reads with low poly A tail lengths. I will extract the reads based on these sizes and provide them to you. Thank you very much Joyjit for your consistently prompt responses!

Cheers, Ranj

tijyojwad commented 7 months ago

thanks @Papareddy - you can determine the pod5 name from the BAM record. Each record will have a fn:Z tag which has the name of the POD5 from which that read came. So if you can send me a few reads where pt:i tag < 20 that would be great.

Yang990-sys commented 7 months ago

With the exception of ribosomal RNA, all of my messenger RNAs (mRNAs) are anticipated to possess poly A tails! I employed dorado recursively on a directory of pod5 files, and presently, I'm uncertain about which pod files contain reads with low poly A tail lengths. I will extract the reads based on these sizes and provide them to you. Thank you very much Joyjit for your consistently prompt responses!

Cheers, Ranj

Hello Ranj,

I guess the difference may come from your sample itself. I am the company analyst in the issue #588. After reanalyzing the data using dorado v0.5.3, both the problems of poly(A)'s detection rate and poly(A)'s length distribution have been solved. The following figure shows the distribution of polyA length after reanalysis, which is consistent with the results of dorado v0.4.3 in issue #588.

Yang990-sys commented 6 months ago

thanks @Papareddy - you can determine the pod5 name from the BAM record. Each record will have a fn:Z tag which has the name of the POD5 from which that read came. So if you can send me a few reads where pt:i tag < 20 that would be great.

Hello, tijyojwad, does this mean that I still need to trim adapter with dorado trim ? And after trimming ,the position of m6A has not been changed, and the redundant sequence after poly(A) has not been trimmed either. But the bam file is indeed different, but it may also be caused by different header. How can I obtain the correct trimmed sequence？

Yang990-sys commented 6 months ago

And This is my trim code

nanoporetech / dorado

Poly(A) issue about RNA004 #613