mbreese / swalign

Smith-Waterman local aligner
Other
67 stars 22 forks source link

cigar #4

Closed JasonR055 closed 4 years ago

JasonR055 commented 10 years ago

The cigar function needs to be updated for softclips or hardclips.

In the below two instances, the cigar should be 5S9M or 9M5S, respectively (or 5H9M or 9M5H if hardclipping).

la = sw.LocalAlignment(scoringmatrix, -16, -4)

a = la.align(ref='TTTTAGYTT', query='AAAAATTTTAGCTT')

a.dump()
Query:  6 TTTTAGCTT 14
          ||||||.||
Ref  :  1 TTTTAGYTT 9

Score: 18
Matches: 8 (88.9%)
Mismatches: 1
CIGAR: 9M

a2 = la.align(ref='TTTTAGYTT', query='TTTTAGCTTAAAAA')

a2.dump()
Query:  1 TTTTAGCTT 9
          ||||||.||
Ref  :  1 TTTTAGYTT 9

Score: 18
Matches: 8 (88.9%)
Mismatches: 1
CIGAR: 9M
mbreese commented 10 years ago

I'm not sure that I need to add support for clipping, since that isn't used much outside of high throughput sequencing. The traditional S/W alignment algorithm doesn't use them and the query sequence should just be silently clipped. The hard clipping can be determined from the q_pos and q_end values. And the hard clips would be the only applicable type in this case.

Thanks for the reports though! I'll make my way through them soon.

Marcus

On Nov 21, 2013, at 7:39 PM, Jason Ross notifications@github.com wrote:

The cigar function needs to be updated for softclips or hardclips.

In the below two instances, the cigar should be 5S9M or 9M5S, respectively (or 5H9M or 9M5H if hardclipping).

la = sw.LocalAlignment(scoringmatrix, -16, -4) a = la.align(ref='TTTTAGYTT', query='AAAAATTTTAGCTT') a.dump()Query: 6 TTTTAGCTT 14 ||||||.||Ref : 1 TTTTAGYTT 9 Score: 18Matches: 8 (88.9%)Mismatches: 1CIGAR: 9M a2 = la.align(ref='TTTTAGYTT', query='TTTTAGCTTAAAAA') a2.dump()Query: 1 TTTTAGCTT 9 ||||||.||Ref : 1 TTTTAGYTT 9 Score: 18Matches: 8 (88.9%)Mismatches: 1CIGAR: 9M

— Reply to this email directly or view it on GitHubhttps://github.com/mbreese/swalign/issues/4 .

JasonR055 commented 10 years ago

Is a wrapper function in the Alignment class for softclipped cigars useful? I will write a wrapper for the swalign.Alignment class as my use case is the realignment of selected reads in a bam file. I also need to modify the cigars for use with pysam. Pysam gives the op codes instead of M, I, D, etc. So a pysam cigar is a list of tuples, such that the alignment of ('ACGTACGT', '-CGTACG-') has a cigar:

[(4, 1), (0, 6), (4, 1)]
JasonR055 commented 10 years ago

After giving this some further thought, there are grounds for implementing softclips in the cigar string. The SAM format specification document states that "sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ"

mbreese commented 10 years ago

The SAM specification changes, and my goal isn’t necessarily to keep in sync with that. I’m not trying to produce SAM files with this alignment package - it would be overkill for this. Hard clipping is acceptable in a SAM file as well, where you remove the clipped sequences in the query (read) sequence that is reported in the file. The sum of M/I/S/=/X should equal the length of the reported query sequence in the SAM file, which isn’t necessarily the same thing as the raw read sequence.

Again, you could make an argument for hard clipping, but that isn’t traditionally done with S/W aligners… it is part of the algorithm that they will be clipped. Reporting the clipping is really only done in SAM files. Look at the results of a BLAST report - they show you the query and reference positions, and they don’t return a CIGAR string.

Thanks!

Marcus

On Nov 21, 2013, at 11:56 PM, Jason Ross notifications@github.com wrote:

After giving this some further thought, there are grounds for implementing softclips in the cigar string. The SAM format specification document states that "sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ"

— Reply to this email directly or view it on GitHub.