Closed dleopold closed 3 years ago
I just checked the output --userfields aln+caln
with USEARCH and it is the same as I described with VSEARCH. So, maybe I am misinterpreting these fields. But it is counterintuitive to me that the --alignout would return :
Qry 1 + GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGGTAACTTTCAAATTCAGAGAAACCC 67
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt 1 + TTTTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGGTAACTTTCAAATTCAGAGAAATTT 67
67 cols, 61 ids (91.0%), 0 gaps (0.0%)
but --userfields caln
would return 67M
Sometimes the M in these alignment strings (CIGARs) indicate either a Match or a Mismatch, as opposed to a D for Deletion or I for Insertion. I think that is the case here. Then you cannot use it to find the number of matches. It may seem unfortunate here, but vsearch is similar to usearch for compatibility reasons. Also, it is quite common to do it like this.
Got it. Thanks for the quick reply. I can use --userfields ids
to get the number of matching positions in the aligned bases.
vsearch's manual defines the M
, D
and I
symbols in CIGAR strings as such: M (match/mismatch), D (deletion) and I (insertion)
. I am the one who wrote it, but maybe it could be reworded to make it clearer that M
captures everything that is not a gap? Suggestions are welcomed.
I was led astray by the definition of--userfield aln
, which reads: Print a string of M (match), D (delete, i.e. a gap in the query) and I (insert, i.e. a gap in the target) representing the pairwise alignment.
I think changing to ...M (match/mismatch)...
, consistent with the definition of --userfield caln
, would have prevented my confusion.
@dleopold thanks a lot for your feedback! (fix in commit 2d4bcf81dc271294b1d6a304fccc04eca97d075d)
The output reported for the aln and caln options for --usefields appears to misidentify terminal mismatches. For example, running --usearch_global on the following two sequences (which differ at the first and last 3 bases):
correctly returns an id of 91%, but reports 67 matches (the length of the seqs) in the aln and caln fields. If the target sequence is longer that the query (or visa versa), the extra bases are correctly shown as insertions (or deletions), but terminal mismatches to the query are still improperly reported as matches. For example, adding 3 additional bases to the target sequence above will return:
caln = 3I67M
Unless I am misinterpreting things, I think this is a bug.
I noticed this because I was trying to calculate a "query-centric" % id that ignores terminal gaps in the target but not the query (ie, global to local). I was surprised to find that such a definition of % id is not a standard option. Perhaps there is a reason it is not implemented?