soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

(Question) leverage mmseqs for clustering with defined number of clusters? #801

Open paoslaos opened 9 months ago

paoslaos commented 9 months ago

Dear developers,

apologies if this is a naive question. Are there any recommended approaches or mmseqs settings / output files that would facilitate to cluster the input sequences into a user defined number of clusters?

Thank you!

milot-mirdita commented 9 months ago

We don't implement any clustering like that, as its usually biologically not very meaningful.

You can compute a sparse all-vs-all search and cluster based on scores with whatever clustering algorithm you prefer that, e.g. scikit-learn implements. You might want to increase --num-seqs in this case though, to report more than the top-300 alignments.

paoslaos commented 5 months ago

Thanks for your answer, this is an interesting problem for many machine learning applications to avoid homology leakage. Here biology is not so important (for me at least). We want to be as fair as possible in this case.

So, if I understand correctly, this will do some prefiltering and then give back sparse similarity values which is indeed something that can be used for this purpose.

Is this still the recommended way to do this, from the user guide?

fake_pref qdb tdb allvsallpref
mmseqs align qdb tdb allvsallpref allvsallaln
mmseqs convertalis qdb tdb allvsallaln allvsall.m8

Thank you! Sincerly, P.