Anybody who has to deal with clusters will need to make use of the size distribution at some point. It's not hard to cough up a script to take the cluster file and make it into a cluster size distribution histogram, but it would be better done once in mmseqs itself. Other useful columns for the cluster size distribution would be the percent of genes, the percent of clusters in that bin, and cumulatives on those starting from size 1 (singletons).
It would also be useful to calculate the following summary statistics and log them as well as writing them to a JSON file:
Anybody who has to deal with clusters will need to make use of the size distribution at some point. It's not hard to cough up a script to take the cluster file and make it into a cluster size distribution histogram, but it would be better done once in mmseqs itself. Other useful columns for the cluster size distribution would be the percent of genes, the percent of clusters in that bin, and cumulatives on those starting from size 1 (singletons).
It would also be useful to calculate the following summary statistics and log them as well as writing them to a JSON file:
Thanks for listening.