how to detect matches containing many ambiguous symbols?

A user submitted this case on the vsearch forum:

Qry  1 + nnnnnnnnnnnnnnnnnnnnnGG 23
         +++++++++++++++++++++||
Tgt  1 + GGCATGAACGATACCGATTAAGG 23

23 cols, 23 ids (100.0%), 0 gaps (0.0%)

How to avoid or detect this kind of matches?

masking has no effect,
minwordmatches (k-mer pre-filtering) has no effect

When aligning sequences, identical symbols will receive a positive match score (default +2). Aligning a pair of symbols where at least one of them is an ambiguous symbol (BDHKMNRSVWY) will always result in a score of zero.

So the alignment score should be low when compared to the alignment length for N-rich queries. With the --userout output option, it is possible to access these alignment parameters:

vsearch \
    --usearch_global <(printf ">query1\nNNNNNNNNNNNNNNNNNNNNNGG\n") \
    --db <(printf ">target1\nGGCATGAACGATACCGATTAAGG\n") \
    --quiet \
    --minseqlength 23 \
    --id 1.0 \
    --userfields query+alnlen+ids+raw \
    --userout -

query1  23  23  4

Indeed, the alignment length is 23, the number of matches is 23, and yet the raw score is only 2, indicating an alignment with 21 ambiguous symbols.

torognes / vsearch

how to detect matches containing many ambiguous symbols? #538