torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

incorrect alignment output in --userfields #429

Closed dleopold closed 3 years ago

dleopold commented 3 years ago

The output reported for the aln and caln options for --usefields appears to misidentify terminal mismatches. For example, running --usearch_global on the following two sequences (which differ at the first and last 3 bases):

query GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGGTAACTTTCAAATTCAGAGAAACCC target CTTTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGGTAACTTTCAAATTCAGAGAAAGGG

correctly returns an id of 91%, but reports 67 matches (the length of the seqs) in the aln and caln fields. If the target sequence is longer that the query (or visa versa), the extra bases are correctly shown as insertions (or deletions), but terminal mismatches to the query are still improperly reported as matches. For example, adding 3 additional bases to the target sequence above will return:

caln = 3I67M

Unless I am misinterpreting things, I think this is a bug.

I noticed this because I was trying to calculate a "query-centric" % id that ignores terminal gaps in the target but not the query (ie, global to local). I was surprised to find that such a definition of % id is not a standard option. Perhaps there is a reason it is not implemented?

dleopold commented 3 years ago

I just checked the output --userfields aln+caln with USEARCH and it is the same as I described with VSEARCH. So, maybe I am misinterpreting these fields. But it is counterintuitive to me that the --alignout would return :

Qry  1 + GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGGTAACTTTCAAATTCAGAGAAACCC 67
            |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||   
Tgt  1 + TTTTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGGTAACTTTCAAATTCAGAGAAATTT 67

67 cols, 61 ids (91.0%), 0 gaps (0.0%)

but --userfields calnwould return 67M

torognes commented 3 years ago

Sometimes the M in these alignment strings (CIGARs) indicate either a Match or a Mismatch, as opposed to a D for Deletion or I for Insertion. I think that is the case here. Then you cannot use it to find the number of matches. It may seem unfortunate here, but vsearch is similar to usearch for compatibility reasons. Also, it is quite common to do it like this.

dleopold commented 3 years ago

Got it. Thanks for the quick reply. I can use --userfields ids to get the number of matching positions in the aligned bases.

frederic-mahe commented 3 years ago

vsearch's manual defines the M, D and I symbols in CIGAR strings as such: M (match/mismatch), D (deletion) and I (insertion). I am the one who wrote it, but maybe it could be reworded to make it clearer that M captures everything that is not a gap? Suggestions are welcomed.

dleopold commented 3 years ago

I was led astray by the definition of--userfield aln, which reads: Print a string of M (match), D (delete, i.e. a gap in the query) and I (insert, i.e. a gap in the target) representing the pairwise alignment. I think changing to ...M (match/mismatch)..., consistent with the definition of --userfield caln, would have prevented my confusion.

frederic-mahe commented 3 years ago

@dleopold thanks a lot for your feedback! (fix in commit 2d4bcf81dc271294b1d6a304fccc04eca97d075d)