torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

vsearch --top_hits_only --maxaccepts 1 returns sometimes 2 values #546

Closed greboul closed 6 months ago

greboul commented 7 months ago

Hi,

I am comparing long fragments to a database and I only want one hit, the "best/top" one. It works for most of my sequences but for a few of them I have 2 hits reported in the "blast like" output file. And yes I have seen that this is the case when the percentage of identity is the same but according to the manual: For a given query, if several top hits present exactly the same percentage of identity, the number of hits reported is controlled by the --maxaccepts value (1 by default).

Here is the command line I used (vsearch 2.23 used): vsearch --threads 30 --usearch_global in.fasta --top_hits_only --maxaccepts 1 --id 0.80 --userfields query+target+id2+ql+tl+alnlen+qcov+mism+opens --userout blast-like-output.tab --db db.fasta --sizeout --output_no_hits --strand both --log vsearch.log

Here is some of my results with two examples of duplicates highlighted in bold.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

query | target | id2 | ql | tl | alnlen | qcov | mism | opens -- | -- | -- | -- | -- | -- | -- | -- | -- b513c635-a578-4e9c-a675-82fa85e03376 | Cryptococcus_neoformans_taxid_235443_seq_261 | 96.4 | 2696 | 5644 | 2703 | 99.4 | 76 | 16 d720ae56-542d-402c-9bf2-1c819ff840a4 | Penicillium_chrysogenum_taxid_5076_seq_435 | 88.3 | 2864 | 5642 | 2890 | 96.7 | 217 | 26 2a462e70-3320-4327-bf1a-0cc3accffd18 | Fusarium_keratoplasticum_ATCC_36031_145013a6c8bb4f4b_18_[1:4050] | 94.9 | 5072 | 4049 | 4079 | 79.4 | 155 | 31 a287ead9-af31-4752-934a-659799f7b040 | Cryptococcus_neoformans_var_grubii_H99_ATCC_208821_feb5a1a086704361_8_[1873380:1879020] | 85.9 | 5103 | 5640 | 5160 | 98.7 | 605 | 54 b0f27bca-1386-47f9-9cce-23b57ba5b721 | Saccharomyces_cerevisiae_taxid_4932_seq_1053 | 96.4 | 4391 | 5935 | 4447 | 99.6 | 85 | 37 **335ab136-5469-4466-9653-c02580ef5885** | Cryptococcus_neoformans_taxid_235443_seq_261 | 89.7 | 5060 | 5644 | 5123 | 99.5 | 441 | 33 **335ab136-5469-4466-9653-c02580ef5885** | Cryptococcus_neoformans_var_grubii_H99_ATCC_208821_feb5a1a086704361_8_[1873380:1879020] | 89.7 | 5060 | 5640 | 5123 | 99.4 | 438 | 38 30ae1505-696b-404c-a499-b4f8f60329b1 | Penicillium_chrysogenum_taxid_5076_seq_435 | 96.8 | 5088 | 5642 | 5126 | 99.4 | 92 | 33 e167fb70-73fd-40f0-8a28-c080f2ce6dd5 | Cryptococcus_neoformans_taxid_235443_seq_261 | 91.8 | 5086 | 5644 | 5133 | 99.3 | 338 | 46 f0ff7852-8fb7-4233-b9b4-5533829da687 | Cryptococcus_neoformans_taxid_235443_seq_261 | 97.8 | 5096 | 5644 | 5123 | 99.5 | 61 | 30 894d6ac6-0dbf-43b5-9cbf-f401ae29fe29 | Saccharomyces_sp._taxid_559292_seq_246 | 98.4 | 5371 | 5936 | 5388 | 99.7 | 54 | 22 ba647e85-c41a-4c22-85a1-25815bec110c | Cryptococcus_neoformans_taxid_235443_seq_261 | 93.2 | 5072 | 5644 | 5115 | 99.6 | 287 | 38 817fadb6-5056-4f59-b080-f10b3405dd47 | Candida_albicans_ATCC_10231_aac24ef60e6c48b8_33_[401829:407410] | 98.6 | 5033 | 5581 | 5045 | 99.7 | 42 | 19 105e1c2b-567d-45c3-bab5-90ce8ab145f1 | Cryptococcus_neoformans_taxid_235443_seq_261 | 93.2 | 4660 | 5644 | 4695 | 99.5 | 264 | 32 8d7eff1b-646b-4f01-b3a1-0f329058ab16 | Candida_albicans_ATCC_10231_aac24ef60e6c48b8_33_[401829:407410] | 95.6 | 5034 | 5581 | 5067 | 99.3 | 155 | 30 22977b43-01e3-47b3-b0ce-a3bb76d8f85f | Cutaneotrichosporon_arboriformis_taxid_387937_seq_691 | 95.7 | 5043 | 5622 | 5084 | 99.8 | 168 | 32 4dd3c954-6bd9-4bd5-9f51-ce177dd28baf | Penicillium_chrysogenum_taxid_5076_seq_435 | 96.2 | 5054 | 5642 | 5118 | 99.5 | 108 | 49 **d79811c2-4d03-4bda-b52e-d66fa4118b9b** | Candida_glabrata_taxid_5478_seq_225 | 92.9 | 3816 | 5987 | 3854 | 99.6 | 218 | 28 **d79811c2-4d03-4bda-b52e-d66fa4118b9b** | Candida_glabrata_ATCC_2001_821f6f754dfb42b5_12_[34129:40109] | 92.9 | 3816 | 5980 | 3854 | 99.4 | 212 | 32 68154556-0f9e-4dd4-8dc8-fc73701684fb | Saccharomyces_sp._taxid_559292_seq_246 | 95.8 | 5386 | 5936 | 5397 | 99.6 | 192 | 23 9bae201f-e2ac-4f6c-881e-1e4784adf809 | Saccharomyces_sp._taxid_559292_seq_246 | 97.5 | 5379 | 5936 | 5408 | 99.3 | 71 | 28 c51a3b81-cefb-4cbe-b362-4be46c7a1a23 | Cutaneotrichosporon_arboriformis_taxid_387937_seq_691 | 92.5 | 5086 | 5622 | 5117 | 99.2 | 308 | 37 b325186a-67c6-4ada-9441-fb43eff8b957 | Penicillium_chrysogenum_taxid_5076_seq_435 | 93.8 | 4951 | 5642 | 5102 | 99.8 | 156 | 34

Any help and/or explanations is welcome!

Thanks!

torognes commented 7 months ago

Thanks for reporting this issue.

Have you tried the --maxhits 1 option?

Could it be that the two hits you sometimes find are on different strands?

frederic-mahe commented 7 months ago

Hi @greboul

in its current version, the manual is a bit misleading, sorry about that. I will update it to reflect the fact that for a given query, --maxaccepts represents the maximum number of matching target sequences to accept before stopping the search. It is set to 1 by default.

Independently, if both strands of a target match a query, then two hits to the same target are reported (one for each strand). The option that controls the overall number of hits reported per query is --maxhits.

Here is a toy-example,with a single query and a single target, but with a 100% match on both strands (--maxaccepts 1 is implicit):

vsearch \
    --usearch_global <(printf ">q1\nACGT\n") \
    --db <(printf ">t1\nACGT\n") \
    --minseqlength 1 \
    --id 1.00 \
    --quiet \
    --strand both \
    --top_hits_only \
    --output_no_hits \
    --userfields query+target+qstrand \
    --userout -

Two hits are reported:

q1  t1  +
q1  t1  -

With the option --maxhits 1:

vsearch \
    --usearch_global <(printf ">q1\nACGT\n") \
    --db <(printf ">t1\nACGT\n") \
    --minseqlength 1 \
    --id 1.00 \
    --quiet \
    --strand both \
    --top_hits_only \
    --output_no_hits \
    --maxhits 1 \
    --userfields query+target+qstrand \
    --userout -

Only one hit is reported:

q1  t1  +

I've also added regression tests to cover that particular set of options (see https://github.com/frederic-mahe/vsearch-tests/commit/d5f97e15a62f02031a971d39c18bfdb66cf1d4ce)

frederic-mahe commented 7 months ago

attempt to clarify the manpage in commit https://github.com/torognes/vsearch/commit/cec128b969d2fbf547c653bc47e63add462483d5

greboul commented 7 months ago

Thank you both for your replies and indeed --maxhits what I was looking for and not maxaccept.

The edit looks good, thanks. Though there is maybe a misspelling for "it zero" that is probably "is zero" in the --maxhits manpage paragraph.

Guillaume

frederic-mahe commented 6 months ago

The edit looks good, thanks. Though there is maybe a misspelling for "it zero" that is probably "is zero" in the --maxhits manpage paragraph.

Thanks! fixed in https://github.com/torognes/vsearch/commit/9ae2c8553f40ab1da9dbbcb145d6a3c37ae691d7

frederic-mahe commented 6 months ago

similar to (part of) issue #388