nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
491 stars 59 forks source link

Fine tuning Poly A tail estimation in plasmids with interrupted polyA tail #903

Closed RunningMatcha closed 3 weeks ago

RunningMatcha commented 3 months ago

Hi,

I am using the PolyA tail estimation in plasmids and I get shorter lengths than expected. I have a few questions about how to set up this feature. The plasmid I sequenced has a very long polyA tail interrupted in the middle by a short sequence, in this example I set it to 10.

I configured the TOML file according to your documentation and the following diagram:

5' ---- ADAPTER ---- DNA ---- FRONT_FLANK ---- poly(A) ---- REAR_FLANK --- DNA ---- 3' OR 5' ---- ADAPTER ---- RC(DNA) ---- RC(REAR_FLANK) ---- poly(T) ---- RC(FRONT_FLANK) ---- RC(DNA) ---- 3'[anchors]

TOML file front_primer = "NNNN" rear_primer = "NNNN" plasmid_front_flank = "CGATCG" plasmid_rear_flank = "TGACTGC"

[threshold] flank_threshold = 0.6

[tail] tail_interrupt_length = 10

Thank you and kind regards!

malton-ont commented 3 months ago

Hi @RunningMatcha,

  1. tail_interrupt_length sets the maximum interruption for which two adjacent regions will be considered a single polyA tail. If there is a larger interruption than this, dorado considers the two sections separately and reports only the "better" one. If the sections are combined, the gap length is included in the tail length estimation. I would typically make this slightly larger than the fixed sequence interruption you are expecting.
  2. Yes, the flanks should surround the entire polyA region. Longer flanks are more likely to be uniquely identified.
  3. flank_threshold is the minimum score required to positively identify a flank region. The score is calculated as (1-editDistance)/flank_length - i.e. a score of 1 is a perfect match, 0 is a complete mismatch. If dorado cannot find either flank, it does not attempt to perform the polyA estimation on that read.
  4. The polyA uses a different algorithm that is separate from the basecall, but is partially dependent on the sequence called (for flank matching and catching trailing bases caught in the flank identification). The quality of the sequence called will affect both of these, though I don't have numbers on by how much, I'm afraid.
  5. This is under consideration for a future release.

Note that the front/rear_primer values are not required for plasmid polyA estimation.