Open diegozea opened 5 years ago
Just confirmed what I learned in #1682 also applies here:
The query must always be shorter than the reference, otherwise the cigar returned is incorrect (resulting in missing indels which cause the begin/ends to result in a diff length). At the moment I believe this is a limitation of SSW itself.
Hi! I saw that ensuring that the query is shorter than the reference is not always working. Example:
from skbio.alignment import StripedSmithWaterman
from skbio.alignment._pairwise import blosum50
query = StripedSmithWaterman('CLRLLNHTFNRDYSHVCVSASESK',
gap_open_penalty=10,
gap_extend_penalty=1,
substitution_matrix=blosum50)
aln = query('TPYTFAVCTEHRGILLQASNDKEMHDWLYAFNPLLAGSI')
assert len(aln.aligned_query_sequence) == len(aln.aligned_target_sequence)
{
'optimal_alignment_score': 137,
'suboptimal_alignment_score': 72,
'query_begin': 0,
'query_end': 23,
'target_begin': 7,
'target_end_optimal': 33,
'target_end_suboptimal': 16,
'cigar': '24M',
'query_sequence': 'CLRLLNHTFNRDYSHVCVSASESK',
'target_sequence': 'TPYTFAVCTEHRGILLQASNDKEMHDWLYAFNPLLAGSI'
}
In this example, I get the correct result using the longest sequence as the query.
I've found the following problem/error in scikit-bio 0.5.5 (Python 3.6.7):
aligned_query_sequence
andaligned_target_sequence
has different lengths for some alignments (using StripedSmithWaterman). For example:code
code + output