torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
674 stars 125 forks source link

vsearch --usearch_global not showing "full alignment" instead only the segment pair #545

Closed gbbio closed 11 months ago

gbbio commented 1 year ago

Hello,

If I use usearch_global to search a short primer sequence against a reference (or preferably search the reference against the primer) I don't get the "full alignment" back. For example:

primer query:   ACAGTGACATGGGGACGTAT
reference:       CAGTGACATGGGGACGTAT...

Qry    2 + CAGTGACATGGGGACGTAT 20
           |||||||||||||||||||
Tgt    1 + CAGTGACATGGGGACGTAT 19

19 cols, 19 ids (100.0%), 0 gaps (0.0%)

The first base of the primer is not there and the identity is 100%. With a global alignment I was expecting something like:

Qry    1 + ACAGTGACATGGGGACGTAT 20
            |||||||||||||||||||
Tgt    1 + -CAGTGACATGGGGACGTAT 19

20 cols, 19 ids (95.0%), 0 gaps (0.0%)

Is this intended?

How can I achieve the above result like the second example? I want to align 1 primer sequence against many references and also detect multiple matches per reference. So preferably using --usearch_global where my file with references is the query. And get the alignment as qrow and trow columns.

Many thanks in advance.

torognes commented 1 year ago

Hi!

Yes, this is intended. VSEARCH uses global alignments, but the terminal gap penalties are very low by default, and terminal gaps are not shown in the alignments. It may therefore look like local alignments. This is similar to USEARCH.

By default, the identity between the query and target is calculated as the percentage of matching nucleotides in the aligned region, excluding gaps at the ends (in the terminal regions). If you want to, you can choose to use a different definition of identity, for instance the CD-HIT definition, which is the number of matching columns divided by the shortest sequence length. In your case, when aligning primers, the primer length will almost always be the shortest sequence. I think that will give you the result you want. Use the option --iddef 0 to use this identity definition. In the example above, the result will be 95%.

You can use the userout and userfields options to get a tab-separated text file with the information you may need. For instance, using --userfields query+target+id0+alnlen+qilo+qihi+tilo+tihi+qrow+trow could be useful in this case. Like this:

vsearch --usearch_global primer.fasta --db ref.fasta --alnout alnout.txt --id 0.8 --iddef 0 --userfields query+target+id0+alnlen+qilo+qihi+tilo+tihi+qrow+trow --userout userout.tsv

Please see the manual for more info about these options.

Some remarks: You cannot get multiple matches within the same database sequence for one query, so you may have to reverse the search by having the primers in the database (as you indicate). If you are using short sequences (less than 32 nucleotides) in the database you need to use the --minseqlength 1 option. You will also need to use the --maxaccepts 0 option to find all matches to each query, not just the best match.

I hope this works for you.

gbbio commented 1 year ago

Thanks for the reply.

The command I am using now is:

vsearch --usearch_global genomes.fa --db primer --id 0.5 --qmask none --strand both --minwordmatches 0 --wordlength 3 --maxaccepts 0 --maxrejects 0 -maxgaps 15 -mincols 5 -userout {output.userout} -qmask none -userfields query+target+id+alnlen+mism+gaps+qilo+qihi+tilo+tihi+qstrand+qrow+trow+ql+tl

I will check the CD-HIT definition and maybe I can do some parsing myself with the help of the aln or caln column. A tseq or qseq column would maybe also give some more options but I see they are not available.

Thanks again for the reply and thanks for this tool.

frederic-mahe commented 11 months ago

Thanks @gbbio I've created a new issue regarding the missing --userfields options (see issue #548).

Also, regression tests added to our test-suite (see https://github.com/frederic-mahe/vsearch-tests/commit/bdb1a505593e19240acdc86a569402b698aaa42a)

Please close the issue if you consider it solved.