soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

(Re)producing stable representative clusters across linclust runs #663

Open hmms117 opened 1 year ago

hmms117 commented 1 year ago

Hi

Is mmseqs deterministic? When running linclust on a large FASTA file of proteins, one would expect to get very similar clusters when rerunning the same command on the same fasta file (with default linclust parameters, with —min-seq-id 0.95 -c 0.8).

Input: fasta file with ~500mio nearly identical sequences (size slowly incrementing, order of sequences may change). Also tested with exact same sequences where order of sequences changed.

Current Behavior

Notice 10-20% of clusters have changed after each run.

Version: latest daily, ubuntu 20.04, 96 core amd server

Any tricks to produce stable clusters? Kmers per seq, sorting the sequences, etc?

Many thanks!

SimonKitSangChu commented 1 year ago

I have a similar issue. It would be helpful to have reproducibility support on clustering.