Closed dleopold closed 6 months ago
@dleopold I think you are right, --weak_id
does not seem to behave as it should. Sorry about that. Please be patient, the issue will be investigated after the Easter break.
My notes so far:
To illustrate the issue, here is a simple example. With q1
(query 1), search should report a weak hit with s2
(98.5% match) and a strong hit with s1
(100% match):
q1="AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA"
s1="AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA" # perfect match
s2="AAACAAGAATACCACGACTAGCAGGAGTATCATGCTTCCCGCCTCGGCGTCTGCTTGGGTGTTTAA" # weaker match (98.5%)
# substitution ^
## expect a weak hit with s2 and a strong hit with s1
vsearch \
--usearch_global <(printf ">q1\n%s\n" "${q1}") \
--db <(printf ">s2\n%s\n>s1\n%s\n" "${s2}" "${s1}") \
--id 1.0 \
--weak_id 0.98 \
--quiet \
--uc -
Instead, it only reports a hit with s1
(perfect match):
H 1 66 100.0 + 0 0 = q1 s1
For future reference, here is the current documentation for --weak_id
:
--weak_id real Show hits with percentage of identity of at least real, without terminating the search. A normal search stops as soon as enough hits are found (as defined by --maxaccepts, --maxrejects, and --id). As --weak_id reports weak hits that are not deduced from --maxaccepts, high --id values can be used, hence preserving both speed and sensitivity. Logically, real must be smaller than the value indicated by --id.
Yes, there seems to be something wrong here. I'm looking into it. it seems like the weak hits found, but not shown. Thanks for reporting the issue.
There is something wrong here, but be aware that the weak_id
option may not work as expected. It is described here:
https://www.drive5.com/usearch/manual/weak_hits.html
It should only report weak hits as long as it is still scanning for true hits. Scanning will be terminated when the specified maximum number of accepted or rejected hits (specified with maxaccepts
and maxrejects
) has been reached . The results of using weak_id
are therefore somewhat unpredictable. It's probably only useful to see if there are may be any weak hits as long as there are no true hits.
In the example above there is a true hit, and as long as maxaccepts
is not more than 1, the weak hit will usually not be reported (unless rare cases where it is found before the true hit).
It does not seem like usearch supports the weak_id
option anymore, at least it seems to be ignored in version 11.0.667.
I think it should work as intended with the latest commit (768da2f), but note that one needs to remove s1 or increase maxaccepts
in the example above to see the weak hit.
Thanks for working on this! For clarity, do hits with a %id below --id but above --weak_id count towards --maxrejects? In practice, using the --weak_id parameter to effectively manage search efficiency would seem to require something like a --max_weak_hits parameter. I would imagine most users of this feature want to stop searching quickly if a hit above --id is found, but would not want maxrejects to be the only stopping heuristic for weak hits since maxrejects is usually set significantly higher than maxaccepts.
Yes, hits below id counts towards maxrejects.
Yes, hits below id counts towards maxrejects.
corresponding tests have been added to our test-suite (see https://github.com/frederic-mahe/vsearch-tests/commit/865e95f8555218ce98603e10046afd7c5ebbae5e), and I've also modified a bit the description of --weak_id
in the manpage https://github.com/torognes/vsearch/commit/452a5e1dd03b6af8bb677391d00df8b98f6d9a09
I am closing that issue, @dleopold thanks for reporting that issue.
I am testing the --weak_id parameter for usearch_global, but it does not appear to be working. When I run a query with --id 0.85 I get ~10,000 of my query sequences matching my database. When I use --id 0.95 and --weak_id 0.85 I only get ~8,000 matches, which is the same as using just --id 0.95 without the weak_id parameter. I may be misinterpreting the way this parameter is supposed to work, but this does not seem right to me.