tseemann / samclip

Filter SAM file for soft and hard clipped alignments
GNU General Public License v3.0
46 stars 10 forks source link

Underestimation of alignment end position for long-reads #5

Open Adamtaranto opened 6 years ago

Adamtaranto commented 6 years ago

Samclip calculates alignment end position as alignment start position + length of read.

my $end = $start + length($sam[SAM_SEQ]) - 1;

This works fine for Illumina data, but often falls short of the true alignment length when dealing with long-reads that may contain many deletions relative to the reference. I expect that this will cause samclip to falsely exclude some long read alignments which are actually soft clipped at the 3' end of contigs.

I fixed this in teloclip by calculating alignment len (in reference) directly from the CIGAR string.

tseemann commented 6 years ago

This tool was only designed for short reads really - and I hadn't considerd your use case. But you are exactly right! I will have a look at teloclip - I assume you just correct for I and D tags.

Adamtaranto commented 6 years ago

Yep, also potential splices and mismatches. See lenCIGAR function.

tseemann commented 6 years ago

Ah yes, the infamous X and = operators. I've never seen them used in practice. Do any of the nanopore tools use them? For short reads they would make the SAM files way too big.

Adamtaranto commented 6 years ago

I haven't seen them in any of my data but figured I should support them just in case.