ratan-lab / sumo

Subtyping tool for multi-omic data
https://pypi.org/project/python-sumo
MIT License
13 stars 1 forks source link

Metrics to decide on the number of clusters #22

Open aakrosh opened 3 years ago

aakrosh commented 3 years ago

SUMO generates plots for cophenetic correlation coefficient and the proportion of ambiguously clustered pairs to assist with determining the optimal number of clusters. Additionally, the following metrics can be helpful in certain scenarios and should be generated:

  1. Jaccard index: In some cases, as we go from k clusters to k+1 clusters, a tiny number of samples are assigned to the new cluster. In such a scenario, k+1 clusters may offer little information regarding classification compared to k clusters. If a is the number of pairs of samples that are in the same subgroup for k and the same subgroup for k+1 clusters, and b is the number of pairs of samples that are either in the same group in k and different in k+1 or same group in k+1, but different in k, then you can calculate this index as a / (a+b).

  2. Silhouette score: can be calculated based on H calculated each time, and the final score can be based on those.

  3. Agreement score: How many pairs of samples in each run of the solver get assigned labels that agree with the consensus labels.