torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

how to detect matches containing many ambiguous symbols? #538

Closed frederic-mahe closed 8 months ago

frederic-mahe commented 8 months ago

A user submitted this case on the vsearch forum:

Qry  1 + nnnnnnnnnnnnnnnnnnnnnGG 23
         +++++++++++++++++++++||
Tgt  1 + GGCATGAACGATACCGATTAAGG 23

23 cols, 23 ids (100.0%), 0 gaps (0.0%)

How to avoid or detect this kind of matches?

When aligning sequences, identical symbols will receive a positive match score (default +2). Aligning a pair of symbols where at least one of them is an ambiguous symbol (BDHKMNRSVWY) will always result in a score of zero.

So the alignment score should be low when compared to the alignment length for N-rich queries. With the --userout output option, it is possible to access these alignment parameters:

vsearch \
    --usearch_global <(printf ">query1\nNNNNNNNNNNNNNNNNNNNNNGG\n") \
    --db <(printf ">target1\nGGCATGAACGATACCGATTAAGG\n") \
    --quiet \
    --minseqlength 23 \
    --id 1.0 \
    --userfields query+alnlen+ids+raw \
    --userout - 
query1  23  23  4

Indeed, the alignment length is 23, the number of matches is 23, and yet the raw score is only 2, indicating an alignment with 21 ambiguous symbols.

frederic-mahe commented 8 months ago

test added to our test suite https://github.com/frederic-mahe/vsearch-tests/commit/2ab142ba68971bdad9164bbb35307725b669f8fa