nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
409 stars 74 forks source link

Using cDNA reads to polish exons of a genome #421

Closed SalvadorGJ closed 1 year ago

SalvadorGJ commented 1 year ago

Hi! I have accurate long cDNA reads which I want to use as input for genome polishing, of course expecting that they will only fix exons across the draft reference. I was wondering if medaka is an appropriate tool to perform this task.

After reading the documentation about minimap2 and its output, I found that regions of the spliced alignments of cDNA-Reference could be tagged as skipped region from the reference using the "N" character on the CIGAR string. For mRNA-to-genome alignment, is suggested that an "N" operation represents an intron.

So I want to ask if medaka considers these kind of alignments to avoid merging contiguous exons; or in other words, if it differentiates between real introns, which are not intended to be removed, and insertions on draft's exons, which should be polished.

cjw85 commented 1 year ago

Medaka uses htslib to parse BAM data. It has no handling of any biological context of meaning of such alignments -- it processes them at face value according to the bases which are indicated as aligned to the reference. The "N" cigar character is handled by this line in the code.

The code will likely do what you expect, though we've never explicitely tested it with cDNA alignments. Be aware also that medaka only uses primary alignments: so if your cDNA reads have been split into multiple alignments by minimap2 the polishing may not be as effective as it could if all alignments were used.