Open macieksk opened 4 years ago
Hi,
The example code here uses an aligner with the following scoring scheme: Match:2, Mismatch:-1, all_gaps_penalties=1
import ssw_aligner query='CAGACAATCAGCATGTTTCCGGCAGCGCCGGTAG' target='TTCCACCATTTGTCCGGACCGGGC' def get_striped_ed_aligner(query_seq,match_score=2): return ssw_aligner.StripedSmithWaterman( query_seq, gap_open_penalty=1, gap_extend_penalty=1, match_score=match_score, mismatch_score=-1, suppress_sequences=True, score_only=False, score_size=2, ## score is < 255 this should be 0; >255 then 1; 2 don't know zero_index=True, mask_length=0, ## Turn off suboptimal mask_auto=False, ## Turn off suboptimal ) algn=get_striped_ed_aligner(query,match_score=2)(target) cigcnt=defaultdict(lambda:0) def add_tl(d,t,l): d[t]+=l; return [add_tl(cigcnt,t,l) for l, t in algn._tuples_from_cigar()] print(cigcnt) print(algn.__repr__())
This results with CIGAR counts and the alignment stats:
defaultdict(<function <lambda> at 0x7f35c0ec2048>, {'M': 19, 'I': 5, 'D': 1}) { 'optimal_alignment_score': 28, 'suboptimal_alignment_score': 0, 'query_begin': 7, 'query_end': 30, 'target_begin': 1, 'target_end_optimal': 21, 'target_end_suboptimal': -1, 'cigar': '7M1I2M1D5M1I1M3I4M', 'query_sequence': '', 'target_sequence': '' }
The issue here is that the CIGAR string returned in the alignment above is for a suboptimal alignment of score 26, see the computations below:
#CAGACAA |TCAGCATGTT TCCGG CAGCGCCGG| TAG # T |TCCACAT TTGTCCGG A CCGG| GC # 'cigar': '7M 1I2M1D 5M1I1M 3I 4M', # M=5+2+2+5+1+4=19 Mismatch=2 Match=17 I=5 D=1 # score=2*17-2-5-1=26 print(2*17-2-5-1) 26
The optimal alignment with the reported score has a different CIGAR string, see below.
#CAGACAA |TC AG CATGTT TCCGGCAGCGCCGG| TAG # T |TC CA CAT TTG TCCGG A CCGG| GC # 'cigar':'2M1D1M1I3M1I2M1D 5M1I1M3I 4M', # M=18 Mismatch=0 I=6 D=2 # score=2*18-6-2=28 print(2*18-6-2) 28
Is it possible to fix it such that the reported CIGAR string is always for an optimal alignment?
Possibly solved in https://github.com/biocore/scikit-bio/pull/1679 ... though seems not quite from tests.
Hi,
The example code here uses an aligner with the following scoring scheme: Match:2, Mismatch:-1, all_gaps_penalties=1
This results with CIGAR counts and the alignment stats:
The issue here is that the CIGAR string returned in the alignment above is for a suboptimal alignment of score 26, see the computations below:
The optimal alignment with the reported score has a different CIGAR string, see below.
Is it possible to fix it such that the reported CIGAR string is always for an optimal alignment?