torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

--weak_id not working #554

Closed dleopold closed 2 months ago

dleopold commented 3 months ago

I am testing the --weak_id parameter for usearch_global, but it does not appear to be working. When I run a query with --id 0.85 I get ~10,000 of my query sequences matching my database. When I use --id 0.95 and --weak_id 0.85 I only get ~8,000 matches, which is the same as using just --id 0.95 without the weak_id parameter. I may be misinterpreting the way this parameter is supposed to work, but this does not seem right to me.

frederic-mahe commented 3 months ago

@dleopold I think you are right, --weak_id does not seem to behave as it should. Sorry about that. Please be patient, the issue will be investigated after the Easter break.

My notes so far:

To illustrate the issue, here is a simple example. With q1 (query 1), search should report a weak hit with s2 (98.5% match) and a strong hit with s1 (100% match):

q1="AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"
s1="AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"  # perfect match
s2="AAACAAGAATACCACGACTAGCAGGAGTATCATGCTTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"  # weaker match (98.5%)
#                        substitution ^

## expect a weak hit with s2 and a strong hit with s1
vsearch \
    --usearch_global <(printf ">q1\n%s\n" "${q1}") \
    --db <(printf ">s2\n%s\n>s1\n%s\n" "${s2}" "${s1}") \
    --id 1.0 \
    --weak_id 0.98 \
    --quiet \
    --uc -

Instead, it only reports a hit with s1 (perfect match):

H   1   66  100.0   +   0   0   =   q1  s1

For future reference, here is the current documentation for --weak_id:

--weak_id real Show hits with percentage of identity of at least real, without terminating the search. A normal search stops as soon as enough hits are found (as defined by --maxaccepts, --maxrejects, and --id). As --weak_id reports weak hits that are not deduced from --maxaccepts, high --id values can be used, hence preserving both speed and sensitivity. Logically, real must be smaller than the value indicated by --id.

torognes commented 3 months ago

Yes, there seems to be something wrong here. I'm looking into it. it seems like the weak hits found, but not shown. Thanks for reporting the issue.

torognes commented 3 months ago

There is something wrong here, but be aware that the weak_id option may not work as expected. It is described here:

https://www.drive5.com/usearch/manual/weak_hits.html

It should only report weak hits as long as it is still scanning for true hits. Scanning will be terminated when the specified maximum number of accepted or rejected hits (specified with maxaccepts and maxrejects) has been reached . The results of using weak_id are therefore somewhat unpredictable. It's probably only useful to see if there are may be any weak hits as long as there are no true hits.

In the example above there is a true hit, and as long as maxaccepts is not more than 1, the weak hit will usually not be reported (unless rare cases where it is found before the true hit).

torognes commented 3 months ago

It does not seem like usearch supports the weak_id option anymore, at least it seems to be ignored in version 11.0.667.

torognes commented 3 months ago

I think it should work as intended with the latest commit (768da2f), but note that one needs to remove s1 or increase maxaccepts in the example above to see the weak hit.

dleopold commented 3 months ago

Thanks for working on this! For clarity, do hits with a %id below --id but above --weak_id count towards --maxrejects? In practice, using the --weak_id parameter to effectively manage search efficiency would seem to require something like a --max_weak_hits parameter. I would imagine most users of this feature want to stop searching quickly if a hit above --id is found, but would not want maxrejects to be the only stopping heuristic for weak hits since maxrejects is usually set significantly higher than maxaccepts.

torognes commented 3 months ago

Yes, hits below id counts towards maxrejects.

frederic-mahe commented 2 months ago

Yes, hits below id counts towards maxrejects.

corresponding tests have been added to our test-suite (see https://github.com/frederic-mahe/vsearch-tests/commit/865e95f8555218ce98603e10046afd7c5ebbae5e), and I've also modified a bit the description of --weak_id in the manpage https://github.com/torognes/vsearch/commit/452a5e1dd03b6af8bb677391d00df8b98f6d9a09

I am closing that issue, @dleopold thanks for reporting that issue.