soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

Easy-cluster results in homology between cluster #354

Open gustavlindved opened 4 years ago

gustavlindved commented 4 years ago

Expected Behavior

I am trying to partition 25000 sequences by homology so identical sequences (for instance >30%ID and >70%coverage - ideally even lower coverage if possible) are grouped into the same partition. Hence, I expect easy-cluster to group sequences that share this much similarity in same cluster

Current Behavior

After clustering I try to partition (by putting all sequences assigned to the same cluster in the same partition), but when I blast two partitions against each other I see that there are quite a lot of sequences assigned to different clusters that share more similarity than the criteria

Steps to Reproduce (for bugs)

Currently I run the following (I've played around with many different settings and this seems to give the best clustering - but still far from optimal) mmseqs easy-cluster sequences.faa test.mm tmp -s 7.5 --threads 12 -c 0.7 --cov-mode 1 --alignment-mode 3 --max-seqs 25000 --min-ungapped-score 0 --mask 0 --add-self-matches -e 20000 --cluster-mode 1 --max-iterations 10000 --cluster-steps 7 .

Am I missing some crucial setting? Any input is greatly appreciated - Thanks!

martin-steinegger commented 3 years ago

@gustavlindved is MMseqs2 not clustering deep enough, meaning it misses sequeces at > 30% sequence identity and 70%> coverge? How many of the representative sequences can still be aligned at > 30% id. and > 70% cov.?