Open gustavlindved opened 4 years ago
@gustavlindved is MMseqs2 not clustering deep enough, meaning it misses sequeces at > 30% sequence identity and 70%> coverge? How many of the representative sequences can still be aligned at > 30% id. and > 70% cov.?
Expected Behavior
I am trying to partition 25000 sequences by homology so identical sequences (for instance >30%ID and >70%coverage - ideally even lower coverage if possible) are grouped into the same partition. Hence, I expect easy-cluster to group sequences that share this much similarity in same cluster
Current Behavior
After clustering I try to partition (by putting all sequences assigned to the same cluster in the same partition), but when I blast two partitions against each other I see that there are quite a lot of sequences assigned to different clusters that share more similarity than the criteria
Steps to Reproduce (for bugs)
Currently I run the following (I've played around with many different settings and this seems to give the best clustering - but still far from optimal) mmseqs easy-cluster sequences.faa test.mm tmp -s 7.5 --threads 12 -c 0.7 --cov-mode 1 --alignment-mode 3 --max-seqs 25000 --min-ungapped-score 0 --mask 0 --add-self-matches -e 20000 --cluster-mode 1 --max-iterations 10000 --cluster-steps 7 .
Am I missing some crucial setting? Any input is greatly appreciated - Thanks!