Closed billzt closed 11 months ago
@billzt I am trying to understand your issue. Here is a minimal, reproducible example for the --usearch_global
command:
vsearch \
--usearch_global <(printf ">q1;size=5\nAAAA\n") \
--db <(printf ">s1;size=4\nAAAA\n") \
--id 0.97 \
--quiet \
--minseqlength 1 \
--sizein \
--sizeout \
--dbmatched /dev/stdout
The --dbmatched
output is:
>s1;size=1
AAAA
which is weird, as I was expecting >s1;size=4
, not >s1;size=1
. I am not sure where the size=1
is coming from (@torognes any idea?). @billzt what were you expecting when using --sizein --sizeout
? size=4
, size=5
or size=9
?
When use --sizein --sizeout
, I expect size=5
, since the result should reflect the size of query.
I expect it should produce the same result as the following example:
vsearch \
--usearch_global <(printf ">q1\nAAAA\n>q2\nAAAA\n>q3\nAAAA\n>q4\nAAAA\n>q5\nAAAA\n") \
--db <(printf ">s1;size=4\nAAAA\n") \
--id 0.97 \
--quiet \
--minseqlength 1 \
--sizeout \
--dbmatched /dev/stdout
You're right, the result should be 5 in the last example. It should be the total of the abundances of the matching queries.
Thanks for reporting this issue!
I'll fix it soon.
Thanks @billzt things are clearer now. The documentation says:
--dbmatched filename Write database target sequences matching at least one query sequence to filename, in fasta format. If the option --sizeout is used, the number of queries that matched each target sequence is indicated using the pattern ";size=integer;".
--sizein
is not mentioned.
The short answer is that, as of now, --sizein
has no effect on --dbmatched
. And if an abundance value is already present in the subject header, it is replaced with the number of matching queries.
I'll try with simple examples (queries are named q
, and database subjects are named s
).
No ;size=n
, a single query, expect a single match ;size=1
:
vsearch \
--usearch_global <(printf ">q1\nAAAA\n") \
--db <(printf ">s1\nAAAA\n") \
--id 0.50 \
--quiet \
--minseqlength 1 \
--sizeout \
--dbmatched /dev/stdout
>s1;size=1
AAAA
No ;size=n
, two queries, expect two matches ;size=2
:
vsearch \
--usearch_global <(printf ">q1\nAAAA\n>q2\nAAAT\n") \
--db <(printf ">s1\nAAAA\n") \
--id 0.50 \
--quiet \
--minseqlength 1 \
--sizeout \
--dbmatched /dev/stdout
>s1;size=2
AAAA
This time, our subject s1
has an abundance value ;size=4
. With two queries, expect two matches (;size=4
is replaced by ;size=2
):
vsearch \
--usearch_global <(printf ">q1\nAAAA\n>q2\nAAAT\n") \
--db <(printf ">s1;size=4\nAAAA\n") \
--id 0.50 \
--quiet \
--minseqlength 1 \
--sizeout \
--dbmatched /dev/stdout
>s1;size=2
AAAA
Last example, our subject s1
has an abundance value ;size=4
, and we use the --sizein
option. With two queries, expect two matches (;size=4
is replaced by ;size=2
):
vsearch \
--usearch_global <(printf ">q1\nAAAA\n>q2\nAAAT\n") \
--db <(printf ">s1;size=4\nAAAA\n") \
--id 0.50 \
--quiet \
--minseqlength 1 \
--sizein \
--sizeout \
--dbmatched /dev/stdout
>s1;size=2
AAAA
If both our subject s1
and queries have abundance values, and we use the --sizein
option. The original abundances are not taken into account, only the number of queries that matched each target sequence is reported by --dbmatched
:
vsearch \
--usearch_global <(printf ">q1;size=3\nAAAA\n>q2;size=2\nAAAT\n") \
--db <(printf ">s1;size=4\nAAAA\n") \
--id 0.50 \
--quiet \
--minseqlength 1 \
--sizein \
--sizeout \
--dbmatched /dev/stdout
>s1;size=2
AAAA
So it seems that --sizein
has no effect on --dbmatched
.
I tested with usearch 11 and it does take the sizein
option into account, but its manual is not clear.
The vsearch test suite now has 32 additional tests covering all possible combinations of parameters for --dbmatched
and abundance values (see https://github.com/frederic-mahe/vsearch-tests/commit/431bdf0d9c0c3ad8cc9d2126ac305af8ec8760da). These tests should all pass once the bug is fixed.
Fixed in commit e54b218. Both for usearch_global
and search_exact
.
The first fix was not correct when not using sizein
. Second fix in commit 6160b58.
Fixed in version 2.23.0.
My commands:
Then this is right
But this is wrong. The abundance value in the FASTA head of the result file is wrong
Why
--sizein
has no effect tovsearch.derep.fasta
?