ratan-lab / sumo

Subtyping tool for multi-omic data
https://pypi.org/project/python-sumo
MIT License
13 stars 1 forks source link

selecting the number of clusters #8

Closed aakrosh closed 4 years ago

aakrosh commented 4 years ago

Currently, SUMO provides little guidance in selecting the optimal number of clusters. Selecting the optimal number of clusters remains a challenging problem, but we provide the cophenetic correlation plot as one guide that can assist the user in selecting a "good" number of clusters. Another plot we should include is the proportion of ambiguous clustering (PAC) plot that shows the fraction of values in the consensus matrix that are in the interval (0.1, 0.9). This plot shows more separation compared the cophenetic correlation plots. For example here is the plot for LGG

pac

The following code can be used to create this plot. The result directory from SUMO is the only input provided in this case.

#!/usr/bin/env python

from sys import argv
from os import listdir
import numpy as np
import pandas as pd
import seaborn as sns

files = listdir(argv[1])
data = [] 
for i in range(2, len(files)+1):
    filename = "%s/k%d/sumo_results.npz" % (argv[1], i)
    sumo = np.load(filename, allow_pickle=True)
    consensus = sumo['unfiltered_consensus']
    num_samples = consensus.shape[0]
    den = (num_samples * num_samples) - num_samples
    num = len(consensus[(consensus > 0.1) & (consensus < 0.9)])
    data.append([i, num*1./den])

df = pd.DataFrame(np.array(data), columns=["K", "PAC"])
g = sns.relplot(x="K", y="PAC", kind="line", data=df)
g.savefig("pac.png")