pkimes / sigclust2

tests for statistical significance of clustering
32 stars 6 forks source link

Accept pre-defined cluster #5

Closed mebbert closed 6 years ago

mebbert commented 6 years ago

Hi, It would be great if it were possible to provide sigclust2 a pre-defined cluster (e.g., from pheatmap).

pkimes commented 6 years ago

Hi @mebbert, Can you clarify what you mean by "pre-defined cluster"? Do you mean a flat set of cluster labels for the samples, e.g. if we have 5 samples, something like c("cluster1", "cluster1", "cluster2", "cluster3", "cluster3")? Or a hierarchical clustering of samples (generated from elsewhere)? Or something completely separate?

mebbert commented 6 years ago

Thank you for your prompt response.

I'd like to run SigClust on the exact hierarchical cluster I generated in pheatmap, but I'm struggling to generate the same cluster directly in SigClust. So, it would be nice to be able to pass the pheatmap object into SigClust.

I'm probably missing something. Here are my pheatmap settings:

pheatmap(adj_contrast(sig.heat.symp.ctrl, 0.5),
         clustering_distance_rows="correlation",
         clustering_distance_cols=dist((1-cor(sig.heat.symp.ctrl, method = "pearson"))),
         clustering_method="complete",
         cluster_cols = TRUE,
         cluster_rows = TRUE)
pkimes commented 6 years ago

@mebbert, sorry for the delay.

Unfortunately, the shc function needs access to the original data matrix (sig.heat.symp.ctrl), and I don't think this information is available in the output of pheatmap. (Let me know if I'm wrong.)

Fortunately, it shouldn't be to hard to run shc on your data set. Although, if your matrix is incredibly large, it might take a while for the analysis to run.

If you want to test for significance of clustering in the rows using "complete" linkage and Pearson correlation as in the code you've posted, we just need to specify metric = "cor" and linkage = "complete" to the shc function. (The other parameters, null_alg= and ci= have to be set to non-default values because correlation-based clustering violates some assumptions of the default algorithm.)

shc(adj_contrast(sig.heat.symp.ctrl, 0.5),  
    metric="cor", linkage="complete", 
    null_alg = "2means", ci = "2CI")

Similarly, to test for significance of clustering in the columns, we can run:

shc(t(adj_contrast(sig.heat.symp.ctrl, 0.5)),  
    metric="cor", linkage="complete", 
    null_alg = "2means", ci = "2CI")

(We simply need to transpose the data matrix with t() because shc tests for significance in the rows of the input matrix.)

Hope this is helpful. Let me know if you have any more questions.

mebbert commented 6 years ago

@pkimes, thank you. Looks like I had a mistake in my pheatmap parameters. I was using dist instead of as.dist for the correlation metric, so I was re-calculating distances based on the correlations rather than simply converting them to a dist object.. That's why I couldn't reproduce the same cluster in shc.

Thanks for your help. Sorry for the confusion.