Is mmseqs deterministic? When running linclust on a large FASTA file of proteins, one would expect to get very similar clusters when rerunning the same command on the same fasta file (with default linclust parameters, with —min-seq-id 0.95 -c 0.8).
Input: fasta file with ~500mio nearly identical sequences (size slowly incrementing, order of sequences may change). Also tested with exact same sequences where order of sequences changed.
Current Behavior
Notice 10-20% of clusters have changed after each run.
Version: latest daily, ubuntu 20.04, 96 core amd server
Any tricks to produce stable clusters? Kmers per seq, sorting the sequences, etc?
Hi
Is mmseqs deterministic? When running linclust on a large FASTA file of proteins, one would expect to get very similar clusters when rerunning the same command on the same fasta file (with default linclust parameters, with —min-seq-id 0.95 -c 0.8).
Input: fasta file with ~500mio nearly identical sequences (size slowly incrementing, order of sequences may change). Also tested with exact same sequences where order of sequences changed.
Current Behavior
Notice 10-20% of clusters have changed after each run.
Version: latest daily, ubuntu 20.04, 96 core amd server
Any tricks to produce stable clusters? Kmers per seq, sorting the sequences, etc?
Many thanks!