soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 190 forks source link

Feature request: cluster size histogram #404

Open joelb123 opened 3 years ago

joelb123 commented 3 years ago

Anybody who has to deal with clusters will need to make use of the size distribution at some point. It's not hard to cough up a script to take the cluster file and make it into a cluster size distribution histogram, but it would be better done once in mmseqs itself. Other useful columns for the cluster size distribution would be the percent of genes, the percent of clusters in that bin, and cumulatives on those starting from size 1 (singletons).

It would also be useful to calculate the following summary statistics and log them as well as writing them to a JSON file:

Thanks for listening.

milot-mirdita commented 3 years ago

I like the idea, we have a summarizing module for taxonomy with taxonomyreport, doing something analogous for clustering sounds useful.