scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
884 stars 268 forks source link

Different lengths for aligned query/target sequence (StripedSmithWaterman) #1654

Open diegozea opened 5 years ago

diegozea commented 5 years ago

I've found the following problem/error in scikit-bio 0.5.5 (Python 3.6.7): aligned_query_sequence and aligned_target_sequence has different lengths for some alignments (using StripedSmithWaterman). For example:

code

from skbio.alignment import StripedSmithWaterman                                                                                                                                                                                      
from skbio.alignment._pairwise import blosum50                                                                                                                                                                                        
query = StripedSmithWaterman(
'SFLSFCCVMANIDLSSIFLSCLILDIDQWNKVIEQLGTPCPEFMKKLQPTVRTYVENRPKYAGYSFEKLFPDVLFPADSEHNKLKA',
gap_open_penalty=10,
gap_extend_penalty=1,
substitution_matrix=blosum50)
aln = query('CICFLAASALHYDFSSFYFGLIKTSLQNVIHRIQGSLSHLS')
assert len(aln.aligned_query_sequence) == len(aln.aligned_target_sequence)

code + output

In [1]: from skbio.alignment import StripedSmithWaterman                                                                                                                                                                                      

In [2]: from skbio.alignment._pairwise import blosum50                                                                                                                                                                                        

In [3]: query = StripedSmithWaterman( 
   ...: 'SFLSFCCVMANIDLSSIFLSCLILDIDQWNKVIEQLGTPCPEFMKKLQPTVRTYVENRPKYAGYSFEKLFPDVLFPADSEHNKLKA', 
   ...: gap_open_penalty=10, 
   ...: gap_extend_penalty=1, 
   ...: substitution_matrix=blosum50)                                                                                                                                                                                                         

In [4]: aln = query('CICFLAASALHYDFSSFYFGLIKTSLQNVIHRIQGSLSHLS')                                                                                                                                                                              

In [5]: aln                                                                                                                                                                                                                                   
Out[5]: 
{
    'optimal_alignment_score': 221,
    'suboptimal_alignment_score': 0,
    'query_begin': 5,
    'query_end': 48,
    'target_begin': 0,
    'target_end_optimal': 40,
    'target_end_suboptimal': 0,
    'cigar': '31M4I9M',
    'query_sequence': 'SFLSFCCVMANIDLSSIFLSCLILDIDQWNKVIEQLGTPCPEFMKKLQPTVRTYVENRPKYAGYSFEKLFPDVLFPADSEHNKLKA',
    'target_sequence': 'CICFLAASALHYDFSSFYFGLIKTSLQNVIHRIQGSLSHLS'
}

In [6]: aln.aligned_query_sequence                                                                                                                                                                                                            
Out[6]: 'CCVMANIDLSSIFLSCLILDIDQWNKVIEQLGTPCPEFMKKLQP'

In [7]: aln.aligned_target_sequence                                                                                                                                                                                                           
Out[7]: 'CICFLAASALHYDFSSFYFGLIKTSLQNVIH----RIQGSLSHLS'

In [8]: len(aln.aligned_query_sequence) == len(aln.aligned_target_sequence)                                                                                                                                                                   
Out[8]: False
ebolyen commented 4 years ago

Just confirmed what I learned in #1682 also applies here:

The query must always be shorter than the reference, otherwise the cigar returned is incorrect (resulting in missing indels which cause the begin/ends to result in a diff length). At the moment I believe this is a limitation of SSW itself.

diegozea commented 4 years ago

Hi! I saw that ensuring that the query is shorter than the reference is not always working. Example:

from skbio.alignment import StripedSmithWaterman
from skbio.alignment._pairwise import blosum50

query = StripedSmithWaterman('CLRLLNHTFNRDYSHVCVSASESK',
                             gap_open_penalty=10,
                             gap_extend_penalty=1,
                             substitution_matrix=blosum50)

aln = query('TPYTFAVCTEHRGILLQASNDKEMHDWLYAFNPLLAGSI')

assert len(aln.aligned_query_sequence) == len(aln.aligned_target_sequence)
{
    'optimal_alignment_score': 137,
    'suboptimal_alignment_score': 72,
    'query_begin': 0,
    'query_end': 23,
    'target_begin': 7,
    'target_end_optimal': 33,
    'target_end_suboptimal': 16,
    'cigar': '24M',
    'query_sequence': 'CLRLLNHTFNRDYSHVCVSASESK',
    'target_sequence': 'TPYTFAVCTEHRGILLQASNDKEMHDWLYAFNPLLAGSI'
}

In this example, I get the correct result using the longest sequence as the query.