Currently, SUMO provides little guidance in selecting the optimal number of clusters. Selecting the optimal number of clusters remains a challenging problem, but we provide the cophenetic correlation plot as one guide that can assist the user in selecting a "good" number of clusters. Another plot we should include is the proportion of ambiguous clustering (PAC) plot that shows the fraction of values in the consensus matrix that are in the interval (0.1, 0.9). This plot shows more separation compared the cophenetic correlation plots. For example here is the plot for LGG
The following code can be used to create this plot. The result directory from SUMO is the only input provided in this case.
#!/usr/bin/env python
from sys import argv
from os import listdir
import numpy as np
import pandas as pd
import seaborn as sns
files = listdir(argv[1])
data = []
for i in range(2, len(files)+1):
filename = "%s/k%d/sumo_results.npz" % (argv[1], i)
sumo = np.load(filename, allow_pickle=True)
consensus = sumo['unfiltered_consensus']
num_samples = consensus.shape[0]
den = (num_samples * num_samples) - num_samples
num = len(consensus[(consensus > 0.1) & (consensus < 0.9)])
data.append([i, num*1./den])
df = pd.DataFrame(np.array(data), columns=["K", "PAC"])
g = sns.relplot(x="K", y="PAC", kind="line", data=df)
g.savefig("pac.png")
Currently, SUMO provides little guidance in selecting the optimal number of clusters. Selecting the optimal number of clusters remains a challenging problem, but we provide the cophenetic correlation plot as one guide that can assist the user in selecting a "good" number of clusters. Another plot we should include is the proportion of ambiguous clustering (PAC) plot that shows the fraction of values in the consensus matrix that are in the interval (0.1, 0.9). This plot shows more separation compared the cophenetic correlation plots. For example here is the plot for LGG
The following code can be used to create this plot. The result directory from SUMO is the only input provided in this case.