Closed AlessioMilanese closed 9 months ago
I found out what's going on:
The default k-mer size that we use in linclust is 17, but spaced with a total of 26 informative and non-informative positions.
So the shortest match that linclust can find is 26.
Then the second chance would have been the normal prefilter of the clustering workflow, which uses k-mer size of 15, but spaced-kmer-size of 23, which with a double k-mer match required, would mean that the shortest k-mer match it can find is 24.
--spaced-kmer-mode 0
fixes your issue.
I get the correct result now. Thanks for checking.
Expected Behavior
I want to remove a sequence that is a substring of a longer sequence.
Looking at https://github.com/soedinglab/MMseqs2/issues/104. I can use
--min-seq-id 1.0 -c 0.9 --cov-mode 1
.So give a
test
file like:Where
>b
is a substring of>a
, and it should be removed.Current Behavior
If I run:
And check the clusters, I still have two clusters:
Steps to Reproduce (for bugs)
You can execute:
Your Environment
I am using a conda installation on a conda env (on a linux server), with version: