soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

how is the representative sequence determined when using the `easy-cluster` command? #712

Open taylorreiter opened 1 year ago

taylorreiter commented 1 year ago

If I use the command:

mmseqs easy-cluster {input} {params.outprefix} tmp/ --min-seq-id 0.9

How will mmseqs determine the representative cluster? I've tried reading through the documentation, issues, and papers, but it wasn't clear to me how the representative was selected for easy-cluster...I'm sorry if I missed something! I think in the linclust paper it's mentioned that the longest sequence is selected as the representative, but I couldn't find a similar citation for easy-cluster.

martin-steinegger commented 1 year ago

@taylorreiter In default we use the greedy set cover algorithm for clustering. It picks the representative with the most alignments as the center. See the figure below. We use the longest sequence if you switch the --cov-mode 1 to uni-directional.

milot-mirdita commented 1 year ago

To slightly expand on this, easy-cluster just calls cluster and a few modules to make the clustering tsv file and the FASTA files. It itself doesn't do anything special, however generating the cluster FASTA files is a bit tricky with the currently available MMseqs2 modules, so the easy-cluster workflow does that for you.

The clustering procedure should be mostly the same since the Linclust paper (the combined Linclust+MMseqs2 clustering). The --cluster-reassign parameter to fix hits with sequence-identity/coverages below the given thresholds that were allowed due to the cascading is not described in the paper and the nucleotide clustering was also not available then.

taylorreiter commented 1 year ago

Thank you both so much, these were incredibly helpful explanations!