(Re)producing stable representative clusters across linclust runs

Is mmseqs deterministic? When running linclust on a large FASTA file of proteins, one would expect to get very similar clusters when rerunning the same command on the same fasta file (with default linclust parameters, with —min-seq-id 0.95 -c 0.8).

Input: fasta file with ~500mio nearly identical sequences (size slowly incrementing, order of sequences may change). Also tested with exact same sequences where order of sequences changed.

Current Behavior

Notice 10-20% of clusters have changed after each run.

Version: latest daily, ubuntu 20.04, 96 core amd server

Any tricks to produce stable clusters? Kmers per seq, sorting the sequences, etc?

Many thanks!

soedinglab / MMseqs2

(Re)producing stable representative clusters across linclust runs #663

Current Behavior