Closed RunningMatcha closed 3 weeks ago
Hi @RunningMatcha,
tail_interrupt_length
sets the maximum interruption for which two adjacent regions will be considered a single polyA tail. If there is a larger interruption than this, dorado considers the two sections separately and reports only the "better" one. If the sections are combined, the gap length is included in the tail length estimation. I would typically make this slightly larger than the fixed sequence interruption you are expecting.flank_threshold
is the minimum score required to positively identify a flank region. The score is calculated as (1-editDistance)/flank_length - i.e. a score of 1 is a perfect match, 0 is a complete mismatch. If dorado cannot find either flank, it does not attempt to perform the polyA estimation on that read.Note that the front/rear_primer values are not required for plasmid polyA estimation.
Hi,
I am using the PolyA tail estimation in plasmids and I get shorter lengths than expected. I have a few questions about how to set up this feature. The plasmid I sequenced has a very long polyA tail interrupted in the middle by a short sequence, in this example I set it to 10.
I wanted to confirm, by setting the tail_interrupt_length to 10, does dorado expects 10 nt (non A) in the middle of the polyA tail, and excludes this region from the total length?
For interrupted PolyA tails with a fixed sequence in the middle, does "plasmid_front_flank" apply only for the first half and "plasmid_rear_flank" for the second half? Would it help if I would input a longer sequence (in the examples provided there are 6 nt)
I would like to know what does the "flank_threshold" mean? I do not understand this from the documentation.
Does the mode used for dorado basecaller (fast/hac/sup) have an impact on the quality of the accuracy of the PolyA tail estimation? or does this feature uses a different algorithm?
By the way, I read in a previous thread that the poly-A tails estimated by this method are not present in the ubam file. I would be greatly appreciated that the estimated tails are represented in the sequences. Do you plan to include this in the future?
I configured the TOML file according to your documentation and the following diagram:
5' ---- ADAPTER ---- DNA ---- FRONT_FLANK ---- poly(A) ---- REAR_FLANK --- DNA ---- 3' OR 5' ---- ADAPTER ---- RC(DNA) ---- RC(REAR_FLANK) ---- poly(T) ---- RC(FRONT_FLANK) ---- RC(DNA) ---- 3'[anchors]
TOML file front_primer = "NNNN" rear_primer = "NNNN" plasmid_front_flank = "CGATCG" plasmid_rear_flank = "TGACTGC"
[threshold] flank_threshold = 0.6
[tail] tail_interrupt_length = 10
Thank you and kind regards!