torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

vsearch - usearch_global - maxhits only returns one hit #513

Closed vincenthahaut closed 1 year ago

vincenthahaut commented 1 year ago

Hi!

I have been using vsearch --usearch_global to find a short sequence in a fasta file. This sequence can be repeated several times in the text with some variations. My goal was to retrieve all the positions in that text of this sequence (identity >0.7). Unfortunately vsearch --usearch_global keeps returning only the top sequence even though there are several of them with the same score.

For instance:

cat test_db.fa

>TEST
CATGAGGCTGGTGTAAAGCGG
# test_actb.fa contains 4 times the TEST sequence with one letter of difference.
cat test_actb.fa
>NG_007992.1:5001-8454
ACCGCCGAGACCGCGTCCGCCCCGCGAGCCCATGAGGTTGGTGTAAAGCGGACAGAGCCTCGCCTTTGCCGATCCGCCGCCCGTCCACACCCGCGCTCGGGAGGCGCGCTCCGGGGGTGCCCATGAGGGTGGTGTAAAGCGGCGCTCTCGGGGCGGGGGCAACCGGCGGGGTCTTTGTCTGAGCTTCCTTCCCAGGGCGTGATGGTGGGCATGCCATGAGGCTGGTGTACAGCGGGGTCAGAAGGATTCCTATGTGGGCGACGAGGCCCAGAGCAACTCGTGTGACAAGGCCATGAGACTGGTGTAAAGCGGCCTTGGAGTGTGTATTAAGTAGGTGCACAGTAGG

vsearch --usearch_global test_actb.fa --minseqlength 15 --maxaccepts 5 --maxhits 0 --strand plus --wordlength 3 --minwordmatches 10 --output_no_hits --mincols 9 --alnout test.txt --db test_db.fa --id 0.75

Only returns one hit:

vsearch v2.22.1_linux_x86_64, 204.4GB RAM, 48 cores

Query >NG_007992.1:5001-8454
 %Id   TLen  Target
 95%     21  TEST

Query 346nt >NG_007992.1:5001-8454
Target  21nt >TEST

Qry 292 + CATGAGACTGGTGTAAAGCGG 312
          |||||| ||||||||||||||
Tgt   1 + CATGAGGCTGGTGTAAAGCGG 21

21 cols, 20 ids (95.2%), 0 gaps (0.0%)

I tried to use different --maxhits (0, 5, 10) as well as to vary --maxaccepts / --maxrejets. I also used different output methods ( --userfields 'query+target+id+alnlen+mism+opens+qilo+qihi+qstrand+tilo+tihi+ql+tl+qrow+trow' --userout test.txt). Vsearch keeps only returning one hit.

My understanding was that maxhits determines the number of hits to return with an identity > --id. Am I wrong or is there a bug with --usearch_global.

Thank you in advance for your answer!

torognes commented 1 year ago

Hi

Unfortunately, VSEARCH will never return more than one hit per database sequence for each query. If the hits were in different database sequences, you could have obtained more hits.

It is --maxaccepts that really determines the number of hits considered; --maxhits can only be used to limit the number of hits shown.

If you are able to split the database sequences into parts that contain one copy of the pattern each, you may be able to use VSEARCH for this. Otherwise, I think you need to use a different tool.

vincenthahaut commented 1 year ago

Thanks a lot for your fast answer! :-)

That's a pity, it is a really great tool to use.

frederic-mahe commented 1 year ago

As the name suggest vsearch --usearch_global performs (semi-)global alignments. It tries to find the single best possible alignment of your query with the target sequence. Terminal gaps are free, allowing alignments of a short sequence against a long one, hence the term semi-global.

A local alignment tool (blastn?) might allow you to find multiple hits of your query in a single target sequence.