nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
452 stars 53 forks source link

Discrepancies between actual reads and polyA tail estimation #687

Open ChristopherAdelmann opened 4 months ago

ChristopherAdelmann commented 4 months ago

Issue Report

Please describe the issue:

While conducting basecalling on direct RNA samples, I've noticed a discrepancy between the number of base-called bases and the estimated length of polyA tails at the 3' end of the reads. Particularly with shorter reads, it's common to find that the total number of bases in a read is less than the estimated polyA length. This discrepancy might stem from two possible reasons:

A. The polyA estimation might occur prior to adapter trimming, resulting in some A bases being trimmed off and thus shorter reads? B. The number of bases called may not directly correlate with the theoretical polyA length bases?

Either way, including the region of the read corresponding to the polyA tail in the SAM file would be immensely helpful in resolving this issue.

Steps to reproduce the issue:

Basecall short direct RNA reads with polyA tail estimation enabled.

Run environment:

tijyojwad commented 4 months ago

Hi @ChristopherAdelmann

Yes this discrepancy is currently expected, especially if the tail length is long. This is because the basecaller currently has difficulty calling long homopolymers. However the tail estimation uses a different algorithm that's tuned for determining polyA lengths specifically.

We've had a couple of other people request adding the estimated polyA sequence into the read as well. We'll discuss internally and look into adding this in a subsequent release.