Feature request: cluster size histogram

Anybody who has to deal with clusters will need to make use of the size distribution at some point. It's not hard to cough up a script to take the cluster file and make it into a cluster size distribution histogram, but it would be better done once in mmseqs itself. Other useful columns for the cluster size distribution would be the percent of genes, the percent of clusters in that bin, and cumulatives on those starting from size 1 (singletons).

It would also be useful to calculate the following summary statistics and log them as well as writing them to a JSON file:

Number of sequences in
Total number of characters in the sequences
Average sequence length
Number of singletons
Fraction of genes in singletons
Size of largest cluster
Fraction of genes in largest cluster
Modal cluster size (peak of size distribution)
Fraction of genes in modal cluster
Which space was used (NA or AA)
Run parameters
processors used
run times

Thanks for listening.

soedinglab / MMseqs2

Feature request: cluster size histogram #404