soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

Clustering substrings #794

Closed AlessioMilanese closed 9 months ago

AlessioMilanese commented 10 months ago

Expected Behavior

I want to remove a sequence that is a substring of a longer sequence.

Looking at https://github.com/soedinglab/MMseqs2/issues/104. I can use --min-seq-id 1.0 -c 0.9 --cov-mode 1.

So give a test file like:

>a
ATTGCATCGAGCAGCGACGAGCTATCGACGATCGATCGATCGATCGATGCATCGATGCATCGATCGATCGATCGTACGATGCATTTTTACGAGCATCGGA
>b
ATTGCATCGAGCAGCGACGAGCTAT

Where >b is a substring of >a, and it should be removed.

Current Behavior

If I run:

mmseqs easy-cluster test t_OUT t_tmp --min-seq-id 1 --cov-mode 1 -c 0.9

And check the clusters, I still have two clusters:

$ cat t_OUT_cluster.tsv
a       a
b       b

$ cat t_OUT_rep_seq.fasta
>a
ATTGCATCGAGCAGCGACGAGCTATCGACGATCGATCGATCGATCGATGCATCGATGCATCGATCGATCGATCGTACGATGCATTTTTACGAGCATCGGA
>b
ATTGCATCGAGCAGCGACGAGCTAT

Steps to Reproduce (for bugs)

You can execute:

echo ">a" > test
echo "ATTGCATCGAGCAGCGACGAGCTATCGACGATCGATCGATCGATCGATGCATCGATGCATCGATCGATCGATCGTACGATGCATTTTTACGAGCATCGGA" >> test
echo ">b" >> test
echo "ATTGCATCGAGCAGCGACGAGCTAT" >> test

mmseqs easy-cluster test t_OUT t_tmp --min-seq-id 1 --cov-mode 1 -c 0.9

Your Environment

I am using a conda installation on a conda env (on a linux server), with version:

MMseqs Version: 15.6f452
milot-mirdita commented 10 months ago

I found out what's going on:

The default k-mer size that we use in linclust is 17, but spaced with a total of 26 informative and non-informative positions.

So the shortest match that linclust can find is 26.

Then the second chance would have been the normal prefilter of the clustering workflow, which uses k-mer size of 15, but spaced-kmer-size of 23, which with a double k-mer match required, would mean that the shortest k-mer match it can find is 24.

--spaced-kmer-mode 0 fixes your issue.

AlessioMilanese commented 9 months ago

I get the correct result now. Thanks for checking.