torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

global alignment and semi-global alignment #482

Closed jianshu93 closed 2 years ago

jianshu93 commented 2 years ago

Hello vsearch tem,

In the second phase of searching in the vsearch paper, I quote "The alignment is performed using a multi- threaded and vectorised full dynamic programming algorithm", is this global alignment or semi-global alignment? Because for a majority cases, we tend to search short queries against longer database sequences. I think in those cases, semi-global alignment will be more useful (no gap penalty at beginning and ends of both sequences, for example, amplicon v3-v4 against full length 16S).

Thanks,

Jianshu

torognes commented 2 years ago

The alignment performed is a global alignment, not a semi-global alignment. However, the gap penalties are a bit special. The terminal gap penalties are by default very low. In vsearch, you can set individual penalties for the left, right and internal gaps, as well as for the query and target sequence. This makes it somewhat similar to semi-global alignment.

jianshu93 commented 2 years ago

Thanks!

It is clear to me. And it is similar to usearch global even though usearch does not use a full dynamic programming.

Thanks, Jianshu

jianshu93 commented 2 years ago

hello team:

vsearch --usearch_global T4AerOil.CoupledReads.fasta --db SILVA_138.1_SSURef_tax_silva_prok_nr.fasta --id 0.97 --blast6out vsearch_T4AerOil.blast6.txt --threads 24 --maxaccepts 1

usearch -usearch_global T4AerOil.CoupledReads.fasta -db SILVA_138.1_SSURef_tax_silva_prok_nr.fasta -id 0.97 -maxaccepts 1 -threads 24 -strand both -blast6out usearch.T4AerOil.blast6.new.txt

for the same input, I have 12 times different for output. Vsearch is 12 time more results found (usearch default maxaccept is 1)

Do I miss something?

Thanks,

Jianshu

frederic-mahe commented 2 years ago

@jianshu93 could you please provide a small fasta file so we can replicate your issue?

Here is a test for --maxaccepts:

vsearch \
    --usearch_global <(printf ">q\nAAAA\n") \
    --db <(printf ">s1\nAAAA\n>s2\nAAAAA\n") \
    --minseqlength 1 \
    --id 0.50 \
    --maxaccepts 1 \
    --quiet \
    --blast6out -

With --maxaccepts 1, only s1 is matched. With --maxaccepts 2, both s1 and s2 are matched.