Unexpected behavior when clustering short sequences

vineeth-s commented 4 months ago

We have 3 sequences we want to cluster:

>2
AGCCGGTAGGACTGAACGTA
>1
AGCCGGTAGGACTGAACATA
>1b
AGCCGGTAGGACTGAACATA

The command to cluster is

vsearch --cluster_size test.fasta --id 0.8 --iddef 4 --consout test_cons.fa --minseqlength 20 --clusterout_sort --clusterout_id --clusters clusters/testcluster

Given the sequences are 20 basepairs long, and the id is specified as 0.8, these sequences should form one cluster, but we get 2 clusters

>centroid=1;seqs=2;clusterid=0
AGCCGGTAGGACTGAACATA
>centroid=2;seqs=1;clusterid=1
AGCCGGTAGGACTGAACGTA

vineeth-s commented 4 months ago

When we now extend the sequence by say 25 bases then this works as intended

Input

>2
AGCCGGTAGGACTGAACGTAACTTCGTACGTACGGCGTCTTATAC
>1
AGCCGGTAGGACTGAACATAACTTCGTACGTACGGCGTCTTATAC
>1b
AGCCGGTAGGACTGAACATAACTTCGTACGTACGGCGTCTTATAC

Output

>centroid=1;seqs=3;clusterid=0
AGCCGGTAGGACTGAACATAACTTCGTACGTACGGCGTCTTATAC

frederic-mahe commented 4 months ago

hello @vineeth-s

the behavior you report is linked to the k-mer prefiltering step used by usearch and re-implemented in vsearch. The vsearch manual states:

The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. That efficient pre-filtering also prevents pairwise alignments with weakly matching targets, as there needs to be at least 6 shared k-mers to start the pairwise alignment, and at least one out of every 16 k-mers from the query needs to match the target.

That statement is not entirely correct, as when wordlength = 8 (default value), there needs to be at least 12 shared 8-mers to start the pairwise alignment (and not 6). I will fix the documentation as such for the next release:

That efficient pre-filtering also prevents pairwise alignments with very short, or with weakly matching targets, as there needs to be by default at least 12 shared k-mers to start the pairwise alignment, and at least one out of every 16 k-mers from the query needs to match the target.

I took the liberty to further simplify your example to illustrate k-mer prefiltering. In your original example, the two 20-bp sequences have ten 8-mers in common:

(
    printf ">2\nAGCCGGTAGGACTGAACGTA\n"
    printf ">1\nAGCCGGTAGGACTGAACATA\n"
) | \
    vsearch \
        --cluster_size - \
        --minseqlength 20 \
        --id 0.8 \
        --iddef 4 \
        --quiet \
        --consout -

Not enough 8-mers in common, no further attempt to align them.

If the mismatch position is the last one, then there are twelve 8-mers in common:

(
    printf ">2\nAGCCGGTAGGACTGAACATG\n"
    printf ">1\nAGCCGGTAGGACTGAACATA\n"
) | \
    vsearch \
        --cluster_size - \
        --minseqlength 20 \
        --id 0.8 \
        --iddef 4 \
        --quiet \
        --consout -

Sequences are aligned, similarities are computed, and sequences are grouped in the same cluster.

As you've shown in your example using 25-bp sequences, k-mer prefiltering does not get in the way when sequences get longer. If you need to compute similarities among very short sequences, I invite you to look at the vsearch --allpairs_global, as it does not rely on k-mer prefiltering.

vineeth-s commented 4 months ago

hi @frederic-mahe, thanks for this detailed explanation, this makes sense now and thanks for integrating this into the documentation

we will try the allpairs_global and see what happens

cheers, vineeth

frederic-mahe commented 4 months ago

(I tend to forget that option exists) You can also try --minwordmatches 10 to lower the common k-mer threshold required to trigger a sequence alignment. Note that doing so should slow down clustering, as vsearch will compute more pairwise sequence alignments.

torognes / vsearch

Unexpected behavior when clustering short sequences #568

Input

Output