igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
87 stars 10 forks source link

Discrepancy SHC and testClusters #20

Closed browaeysrobin closed 7 months ago

browaeysrobin commented 7 months ago

Hi @igrabski

I have some questions about the workings of scSHC::testClusters compared to scSHC::scSHC. I would expect that applying the significance test approach implemented by scSHC::testClusters on clusters obtained with scSHC::scSHC would return the same clusters and not merge clusters together. However, for two datasets I tried it on, some scSHC-defined clusters were merged after running testClusters. Could you help me explain how this could be possible?

I used the following code to run this test:

clusters_scSHC <- scSHC::scSHC(seuratObj@assays$RNA@counts, alpha = 0.05, num_features = 2500, num_PCs = 30, parallel = T, cores = 3)

cluster_significance_test <- scSHC::testClusters(seuratObj@assays$RNA@counts,cluster_ids=clusters_scSHC[[1]], alpha = 0.05, num_features = 2500, num_PCs = 30, parallel = T, cores = 3)

table(cluster_significance_test[[1]],clusters_scSHC[[1]])

igrabski commented 7 months ago

There are three main reasons why this might happen. The first is that both scSHC and testClusters give stochastic results, due to the randomness involved in simulating the null distribution, so minor differences in output are possible based on the seed. However, the more likely reasons are that testClusters can be thought of as a more approximate procedure than scSHC. Whereas scSHC performs hierarchical clustering on cells and then tests clusters by proceeding down the tree, testClusters begins by performing hierarchical clustering on pseudobulked profiles of the clusters. This could potentially result in testing clusters in a different order from scSHC, which could then subsequently yield different results. Finally, when generating the empirical null distribution in scSHC, we apply the same hierarchical clustering procedure to each simulated null dataset. However, in testClusters, because we don't know what clustering procedure was used to create the original clusters, we use a nearest neighbors approach to define clusters in each null dataset, which could result in somewhat different clusters and therefore a different empirical null distribution of the clustering test statistic. So, while testClusters and scSHC would be perfectly consistent in an ideal world, the approximations required for testClusters could result in discrepancies like you observed.

browaeysrobin commented 7 months ago

Hi @igrabski

Thank you for the clear explanation!