Closed vineeth-s closed 4 months ago
When we now extend the sequence by say 25 bases then this works as intended
>2
AGCCGGTAGGACTGAACGTAACTTCGTACGTACGGCGTCTTATAC
>1
AGCCGGTAGGACTGAACATAACTTCGTACGTACGGCGTCTTATAC
>1b
AGCCGGTAGGACTGAACATAACTTCGTACGTACGGCGTCTTATAC
>centroid=1;seqs=3;clusterid=0
AGCCGGTAGGACTGAACATAACTTCGTACGTACGGCGTCTTATAC
hello @vineeth-s
the behavior you report is linked to the k-mer prefiltering step used by usearch
and re-implemented in vsearch
. The vsearch manual states:
The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. That efficient pre-filtering also prevents pairwise alignments with weakly matching targets, as there needs to be at least 6 shared k-mers to start the pairwise alignment, and at least one out of every 16 k-mers from the query needs to match the target.
That statement is not entirely correct, as when wordlength
= 8 (default value), there needs to be at least 12 shared 8-mers to start the pairwise alignment (and not 6). I will fix the documentation as such for the next release:
That efficient pre-filtering also prevents pairwise alignments with very short, or with weakly matching targets, as there needs to be by default at least 12 shared k-mers to start the pairwise alignment, and at least one out of every 16 k-mers from the query needs to match the target.
I took the liberty to further simplify your example to illustrate k-mer prefiltering. In your original example, the two 20-bp sequences have ten 8-mers in common:
(
printf ">2\nAGCCGGTAGGACTGAACGTA\n"
printf ">1\nAGCCGGTAGGACTGAACATA\n"
) | \
vsearch \
--cluster_size - \
--minseqlength 20 \
--id 0.8 \
--iddef 4 \
--quiet \
--consout -
Not enough 8-mers in common, no further attempt to align them.
If the mismatch position is the last one, then there are twelve 8-mers in common:
(
printf ">2\nAGCCGGTAGGACTGAACATG\n"
printf ">1\nAGCCGGTAGGACTGAACATA\n"
) | \
vsearch \
--cluster_size - \
--minseqlength 20 \
--id 0.8 \
--iddef 4 \
--quiet \
--consout -
Sequences are aligned, similarities are computed, and sequences are grouped in the same cluster.
As you've shown in your example using 25-bp sequences, k-mer prefiltering does not get in the way when sequences get longer. If you need to compute similarities among very short sequences, I invite you to look at the vsearch --allpairs_global
, as it does not rely on k-mer prefiltering.
hi @frederic-mahe, thanks for this detailed explanation, this makes sense now and thanks for integrating this into the documentation
we will try the allpairs_global and see what happens
cheers, vineeth
(I tend to forget that option exists) You can also try --minwordmatches 10
to lower the common k-mer threshold required to trigger a sequence alignment. Note that doing so should slow down clustering, as vsearch will compute more pairwise sequence alignments.
We have 3 sequences we want to cluster:
The command to cluster is
Given the sequences are 20 basepairs long, and the id is specified as 0.8, these sequences should form one cluster, but we get 2 clusters