torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Issue related to usearch_global match #547

Closed NkaziN closed 7 months ago

NkaziN commented 7 months ago

I’ve recently noticed an odd behavior with how the program matches sequences for global alignments. I’m running the usearch_global command below and realized that the program preferentially selects the longer match unless a sequence length filter (in parenthesis) is implemented. This is unusual since the raw score, mismatches, and percent identity seem worse.

Below is a test case alongside the query/database/code used to generate it.

vsearch --usearch_global query_sequence.txt --db target_database.txt --userout sample.out --userfields query+target+id+alnlen+mism+gaps+qilo+qihi+tilo+tihi+evalue+bits+raw+pairs+qrow+trow --id 0.65 (--maxseqlength 2000)

Do you have any insight into why the program preferentially selects the sequence with a lower raw score?

Vsearch Matches.xlsx target_database.txt query_sequence.txt

NkaziN commented 7 months ago

A related observation is that, for the example above, the behavior disappears when the --maxaccepts option is above 1. The program correctly identifies both sequences as matches and sorts them by decreasing raw score.

vsearch --usearch_global query_sequence.txt --db target_database.txt --userout sample.out --userfields query+target+id+alnlen+mism+gaps+qilo+qihi+tilo+tihi+evalue+bits+raw+pairs+qrow+trow --id 0.65 --maxaccepts 2

torognes commented 7 months ago

Hi

The search algorithm is heuristic, and it is not guaranteed to find the best match. It just finds the first match that is good enough, given an identity threshold, i.e. 65% in your example, and other criteria specified. The heuristics are based on the number of shared kmers between the sequences, so usually the match is quite good, but occationally they are not that good. If you increase maxaccepts from 1, it will consider more matches, and choose the one with the highest identity. That's why the results are better for you in that case. If you set both maxaccepts and maxrejects to zero, it will scan the entire database for the best match, but it will take much more time.

frederic-mahe commented 7 months ago

Indeed, this is the way usearch behaves too: kmer profile filtering can sometime favor longer sequences.

As mentioned, increasing --maxaccepts solves the issue. I've created a toy-example where t2 has two mismatches with the query Q, and t1 only one, but t2 is longer and has more kmers in common with the query, so it is placed at the top of the list of possible matches. If maxaccepts is set to 1 (default value), then t2 is selected.

t1   AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA
     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||
Q    AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAAACCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA
     ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||| ||||||||||||||||||||||
t2 AAAGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTACATGCGCCCTGGGTCCCGGAGTAG
Q="AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAAACCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA"
t1="AGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTAGATGCGCCCTGGGTCCCGGAGTA"
t2="AAAGATAGGGACGTGTACCAATCAGCGTTGTTCTGCCTCGTGAATCCGAACATAGGCACTTATTTCGAATCCAGGATAAGGCTACATGCGCCCTGGGTCCCGGAGTAG"

vsearch \
    --usearch_global <(printf ">q1\n%s\n" "${Q}") \
    --db <(printf ">t1\n%s\n>t2\n%s\n" "${t1}" "${t2}") \
    --wordlength 3 \
    --id 0.9 \
    --quiet \
    --userfields query+target+id \
    --userout -

(see https://github.com/frederic-mahe/vsearch-tests/commit/9a875b04660f3f0d6ac04afdfc51618a81d6fe76 for more details on the selection of Q and more tests)

NkaziN commented 7 months ago

This makes sense, thanks for the responsiveness!