pkimes / sigclust2

tests for statistical significance of clustering
32 stars 6 forks source link

Use SigClust to identify significant gene clusters? #9

Open mebbert opened 9 months ago

mebbert commented 9 months ago

Hello,

We'd like to use SigClust to identify significant gene clusters, also. Is there any fundamental reason we can't use SigClust to do that? i.e., if we simply provide the genes as rows?

Any specific recommendations on parameters?

Really appreciate it.

pkimes commented 9 months ago

Hi Mark - no, there's no fundamental reason you wouldn't be able to use SigClust. However, I want to point out that sigclust / shc are post hoc methods that sit on top of common clustering methods (k-means or hierarchical clustering).

Maybe this will seem like semantics, but I view sigclust not as a method for identifying significant clusters, but rather a method for assessing the significance of identified clusters (i.e. sigclust/shc are not actually clustering methods). Because of this, it's hard for me to make any specific recommendations on parameters beyond recommending using what you would use to cluster in the first place (based on the known distribution/characteristics of the original data). If this doesn't align with the assumptions / supported parameters of sigclust / shc, it may not make sense to apply the method directly. (This is a reason others have built on the approach for specific applications, e.g. for clustering in single cell.)

Hope this information is useful.

mebbert commented 9 months ago

Thanks, @pkimes! Very helpful. And thanks for pointing to the single-cell adaptation. That will be super helpful for some of our other work.

I guess the reason I think of SigClust as identifying significant clusters is that it provides a systematic (and statistical) way of determining whether/where we can say one cluster starts and ends rather than saying "this cutoff fits my story really well." :-)

I have two related questions that may merit opening separate issues. Our current cluster contains ~14,000 genes, which I believe comes to ~98M pairwise correlations. It's taking SigClust days and it's still not finishing (using dist(1-cor(t(x), method = "pearson"))). I may be mistaken, but I think it's cor that's taking so long.

Questions:

  1. It doesn't take pheatmap days to do the original clustering. Do you know what the difference is?
  2. Could you modify SigClust to allow us to input a pre-calculated cluster (e.g., from pheatmap) so it doesn't have to re-generate the cluster?

Thanks!

pkimes commented 7 months ago

Hi Mark - sorry. I missed this. The reason this is taking so long is because of the way sigclust and shc work. They're hypothesis testing methods which use simulation from a null distribution to calculate the p-value. In the case of shc, the test is performed for each subtree of the hierarchical clustering.

So, if using the default n_sum = 100, after the initial correlation based hierarchical clustering, sigclust is simulating 100 null datasets of size ~14,000 and performing hierarchical clustering 100 more times. Then, at the subsequent subtrees, e.g. if the 14,000 is broken into clusters of size 5000 and 9000, 100 null datasets of size 5000 are simulated and clustered 100 more times, and on and on. SigClust and SHC are not closed-form methods - they depend on empirical reference distributions by simulation. This takes extra time.

I hope this clarifies the issue and also makes it clear why the proposal in Q2 wouldn't work.