torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

--sizein seems having no effect in vsearch --usearch_global #521

Closed billzt closed 11 months ago

billzt commented 1 year ago

My commands:

vsearch --fastx_uniques vsearch.fa.gz --sizeout --relabel Uniq --fastaout vsearch.derep.fasta
vsearch --cluster_unoise vsearch.derep.fasta --sizein --sizeout --sizeorder --centroids vsearch.centroids.fasta
vsearch --uchime3_denovo vsearch.centroids.fasta --sizein --sizeout --nonchimeras vsearch.zotus.fasta --relabel Zotu

Then this is right

vsearch --usearch_global vsearch.fa.gz --db vsearch.zotus.fasta --id 0.97 --sizeout --dbmatched vsearch.zotus.size.fasta

But this is wrong. The abundance value in the FASTA head of the result file is wrong

vsearch --usearch_global vsearch.derep.fasta --sizein --db vsearch.zotus.fasta --id 0.97 --sizeout --dbmatched vsearch.zotus.size.fasta

Why --sizein has no effect to vsearch.derep.fasta?

frederic-mahe commented 1 year ago

@billzt I am trying to understand your issue. Here is a minimal, reproducible example for the --usearch_global command:

vsearch \
    --usearch_global <(printf ">q1;size=5\nAAAA\n") \
    --db <(printf ">s1;size=4\nAAAA\n") \
    --id 0.97 \
    --quiet \
    --minseqlength 1 \
    --sizein \
    --sizeout \
    --dbmatched /dev/stdout

The --dbmatched output is:

>s1;size=1
AAAA

which is weird, as I was expecting >s1;size=4, not >s1;size=1. I am not sure where the size=1 is coming from (@torognes any idea?). @billzt what were you expecting when using --sizein --sizeout ? size=4, size=5 or size=9?

billzt commented 1 year ago

When use --sizein --sizeout, I expect size=5, since the result should reflect the size of query.

I expect it should produce the same result as the following example:

vsearch \
    --usearch_global <(printf ">q1\nAAAA\n>q2\nAAAA\n>q3\nAAAA\n>q4\nAAAA\n>q5\nAAAA\n") \
    --db <(printf ">s1;size=4\nAAAA\n") \
    --id 0.97 \
    --quiet \
    --minseqlength 1 \
    --sizeout \
    --dbmatched /dev/stdout
torognes commented 1 year ago

You're right, the result should be 5 in the last example. It should be the total of the abundances of the matching queries.

Thanks for reporting this issue!

I'll fix it soon.

frederic-mahe commented 1 year ago

Thanks @billzt things are clearer now. The documentation says:

--dbmatched filename Write database target sequences matching at least one query sequence to filename, in fasta format. If the option --sizeout is used, the number of queries that matched each target sequence is indicated using the pattern ";size=integer;".

--sizein is not mentioned.

The short answer is that, as of now, --sizein has no effect on --dbmatched. And if an abundance value is already present in the subject header, it is replaced with the number of matching queries.


I'll try with simple examples (queries are named q, and database subjects are named s).

No ;size=n, a single query, expect a single match ;size=1:

vsearch \
    --usearch_global <(printf ">q1\nAAAA\n") \
    --db <(printf ">s1\nAAAA\n") \
    --id 0.50 \
    --quiet \
    --minseqlength 1 \
    --sizeout \
    --dbmatched /dev/stdout
>s1;size=1
AAAA

No ;size=n, two queries, expect two matches ;size=2:

vsearch \
    --usearch_global <(printf ">q1\nAAAA\n>q2\nAAAT\n") \
    --db <(printf ">s1\nAAAA\n") \
    --id 0.50 \
    --quiet \
    --minseqlength 1 \
    --sizeout \
    --dbmatched /dev/stdout
>s1;size=2
AAAA

This time, our subject s1 has an abundance value ;size=4. With two queries, expect two matches (;size=4 is replaced by ;size=2):

vsearch \
    --usearch_global <(printf ">q1\nAAAA\n>q2\nAAAT\n") \
    --db <(printf ">s1;size=4\nAAAA\n") \
    --id 0.50 \
    --quiet \
    --minseqlength 1 \
    --sizeout \
    --dbmatched /dev/stdout
>s1;size=2
AAAA

Last example, our subject s1 has an abundance value ;size=4, and we use the --sizein option. With two queries, expect two matches (;size=4 is replaced by ;size=2):

vsearch \
    --usearch_global <(printf ">q1\nAAAA\n>q2\nAAAT\n") \
    --db <(printf ">s1;size=4\nAAAA\n") \
    --id 0.50 \
    --quiet \
    --minseqlength 1 \
    --sizein \
    --sizeout \
    --dbmatched /dev/stdout
>s1;size=2
AAAA

If both our subject s1 and queries have abundance values, and we use the --sizein option. The original abundances are not taken into account, only the number of queries that matched each target sequence is reported by --dbmatched:

vsearch \
    --usearch_global <(printf ">q1;size=3\nAAAA\n>q2;size=2\nAAAT\n") \
    --db <(printf ">s1;size=4\nAAAA\n") \
    --id 0.50 \
    --quiet \
    --minseqlength 1 \
    --sizein \
    --sizeout \
    --dbmatched /dev/stdout
>s1;size=2
AAAA

So it seems that --sizein has no effect on --dbmatched.

torognes commented 1 year ago

I tested with usearch 11 and it does take the sizein option into account, but its manual is not clear.

frederic-mahe commented 1 year ago

The vsearch test suite now has 32 additional tests covering all possible combinations of parameters for --dbmatched and abundance values (see https://github.com/frederic-mahe/vsearch-tests/commit/431bdf0d9c0c3ad8cc9d2126ac305af8ec8760da). These tests should all pass once the bug is fixed.

torognes commented 1 year ago

Fixed in commit e54b218. Both for usearch_global and search_exact.

torognes commented 1 year ago

The first fix was not correct when not using sizein. Second fix in commit 6160b58.

torognes commented 11 months ago

Fixed in version 2.23.0.