torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

is there a major vote fraction parameter of a vsearch clustered consensus sequence? #557

Closed ShannonDaddy closed 2 months ago

ShannonDaddy commented 2 months ago

Hi, thanks for the powerful tool vsearch. when I use vsearch cluster_size to make a consensus sequence, I can't find a major vote fraction parameter for controlling the quality of consensus bases from a multiple alignment result. Would you elaborate the algorithm of making consensus sequences, is there a default major vote fraction for calling the consensus bases? Thank you for your support.

torognes commented 2 months ago

Hi, no there is no such option. It is not sophisticated. It simply chooses the most common base in each position, so it could be almost down to 25%. If there are two equally common bases, it chooses the first in the alphabet of A, C, G, or T. If there are no ordinary bases, but at least one N, it uses N. If there are more gap symbols (-) than bases in a column, it uses a gap symbol.

ShannonDaddy commented 2 months ago

Hi, no there is no such option. It is not sophisticated. It simply chooses the most common base in each position, so it could be almost down to 25%. If there are two equally common bases, it chooses the first in the alphabet of A, C, G, or T. If there are no ordinary bases, but at least one N, it uses N. If there are more gap symbols (-) than bases in a column, it uses a gap symbol.

Thanks, I think the present algorithm is quite reasonable.

frederic-mahe commented 2 months ago

I've implemented tests covering --consout (see https://github.com/frederic-mahe/vsearch-tests/commit/1812353ee4b10098dd9f9c882e0309bcf461023d). However, I was not able to reproduce this statement:

If there are more gap symbols (-) than bases in a column, it uses a gap symbol.

Gaps in 5' or 3' do not seem to appear in the consensus sequence. Here is a toy example:

printf ">q1\nCGTA\n>q2\nCGT\n>q3\nCGT\n" | \
    vsearch \
        --cluster_size - \
        --minseqlength 1 \
        --id 0.5 \
        --quiet \
        --consout -

Note: I did not try to make an example with a gap in the middle of the sequence.

torognes commented 2 months ago

You're right, there will never be gaps in the consensus sequence. They only appear in the multiple alignment (--msaout).

frederic-mahe commented 2 months ago

You're right, there will never be gaps in the consensus sequence. They only appear in the multiple alignment (--msaout).

Noted, thanks. Corresponding tests added to our test-suite (https://github.com/frederic-mahe/vsearch-tests/commit/6427a2b25f8f610ab57cb633fdf6f966fa2d25fa)