Closed NkaziN closed 7 months ago
A related observation is that, for the example above, the behavior disappears when the --maxaccepts option is above 1. The program correctly identifies both sequences as matches and sorts them by decreasing raw score.
vsearch --usearch_global query_sequence.txt --db target_database.txt --userout sample.out --userfields query+target+id+alnlen+mism+gaps+qilo+qihi+tilo+tihi+evalue+bits+raw+pairs+qrow+trow --id 0.65 --maxaccepts 2
Hi
The search algorithm is heuristic, and it is not guaranteed to find the best match. It just finds the first match that is good enough, given an identity threshold, i.e. 65% in your example, and other criteria specified. The heuristics are based on the number of shared kmers between the sequences, so usually the match is quite good, but occationally they are not that good. If you increase maxaccepts
from 1, it will consider more matches, and choose the one with the highest identity. That's why the results are better for you in that case. If you set both maxaccepts
and maxrejects
to zero, it will scan the entire database for the best match, but it will take much more time.
Indeed, this is the way usearch behaves too: kmer profile filtering can sometime favor longer sequences.
As mentioned, increasing --maxaccepts
solves the issue. I've created a toy-example where t2
has two mismatches with the query Q
, and t1
only one, but t2
is longer and has more kmers in common with the query, so it is placed at the top of the list of possible matches. If maxaccepts
is set to 1 (default value), then t2
is selected.
t1 AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||
Q AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAAACCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||| ||||||||||||||||||||||
t2 AAAGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTACATGCGCCCTGGGTCCCGGAGTAG
Q="AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAAACCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA"
t1="AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA"
t2="AAAGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTACATGCGCCCTGGGTCCCGGAGTAG"
vsearch \
--usearch_global <(printf ">q1\n%s\n" "${Q}") \
--db <(printf ">t1\n%s\n>t2\n%s\n" "${t1}" "${t2}") \
--wordlength 3 \
--id 0.9 \
--quiet \
--userfields query+target+id \
--userout -
(see https://github.com/frederic-mahe/vsearch-tests/commit/9a875b04660f3f0d6ac04afdfc51618a81d6fe76 for more details on the selection of Q
and more tests)
This makes sense, thanks for the responsiveness!
I’ve recently noticed an odd behavior with how the program matches sequences for global alignments. I’m running the usearch_global command below and realized that the program preferentially selects the longer match unless a sequence length filter (in parenthesis) is implemented. This is unusual since the raw score, mismatches, and percent identity seem worse.
Below is a test case alongside the query/database/code used to generate it.
vsearch --usearch_global query_sequence.txt --db target_database.txt --userout sample.out --userfields query+target+id+alnlen+mism+gaps+qilo+qihi+tilo+tihi+evalue+bits+raw+pairs+qrow+trow --id 0.65 (--maxseqlength 2000)
Do you have any insight into why the program preferentially selects the sequence with a lower raw score?
Vsearch Matches.xlsx target_database.txt query_sequence.txt