soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.31k stars 184 forks source link

Set cascaded clustering sensitivity levels #409

Open bbuchfink opened 3 years ago

bbuchfink commented 3 years ago

I would like to be able to run cascaded clustering with explicitly defined sensitivity levels, and also disable the first Linclust step. Can this be done somehow?

martin-steinegger commented 3 years ago

Hi Benjamin, we currently have no option to turn of Linclust or to set user defined sensitivity levels for each step. Having flags for both might be useful. Could you please explain your use case a bit more? Maybe we have already some mechanism that might solve some of the issues.

bbuchfink commented 3 years ago

Hi Martin, my use case is a bachelor student who wants to compare clustering with Diamond and MMSeqs2. We already did runs with Linclust being enabled, but since Diamond unfortunately does not have a Linclust-like feature, we also want to run a comparsion with Linclust disabled (and sensitivity levels that match those of Diamond).

martin-steinegger commented 3 years ago

You could try to compare it with the single step clustering --single-step-clustering. But the regular Linclust + cascaded clustering workflow is much faster. For benchmarking you could do this two things: (1) Just hardcode your sensitivities levels in src/workflows/Cluster.cpp line 195 for now. (2) Remove the linclust call in data/cascaded_clustering.sh. But we might add this feature the next days.

bbuchfink commented 3 years ago

Thanks, I tried to hack your script and it looks like it's working. Let me know in case you add the feature.

joelb123 commented 3 years ago

I'll second the idea that being able to scan identity levels is useful. Log steps in (1-identity) is generally the right step spacing. Log-log plots of the deltas in cluster sizes make a very informative plot with peaks at any genome duplication events.