igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
87 stars 10 forks source link

Cluster Number is Very Large #16

Closed DarioS closed 10 months ago

DarioS commented 11 months ago

I applied scSHC to the counts in the file HNSCC_fibro_tumour.rds contained in scRNA-seq_dataobjects.zip from a recent journal article and it estimated 116 clusters on default parameters, which is far too many.

igrabski commented 10 months ago

Hi Dario, do these data have any batch effect / multi-sample structure? If so, did you run clustering with scSHC or with testClusters?

DarioS commented 10 months ago

I don't think so. Supplementary Figure 1e is "UMAP of CAF type assignment from raw, uncorrected scRNA-Seq data." and the clusters seem homogeneous to me, suggesting a lack of batch effect. I used function scSHC. Can you reproduce the result?

igrabski commented 10 months ago

Hmm, I was unable to reproduce your result. If you share your code (and especially your seed if you set one), I can take a look? I attached a screenshot of what I ran and the output:

Screenshot 2023-09-14 at 4 22 34 PM
DarioS commented 10 months ago

Sorry for the confusion. It happens with BREAST_fibro_tumour.rds, not with HNSCC_fibro_tumour.rds.

igrabski commented 10 months ago

Thanks for the clarification! I was able to reproduce your result. The number of clusters is reduced when specifying the batch label, but still seems higher than would be expected. I have not yet fully investigated these data, but the large number of clusters makes me suspect that perhaps the assumptions of the method do not hold here. In particular, we assume that within clusters, genes follow a unimodal distribution (specifically, Poisson-log normal); however, if there are cells where genes actually follow a multimodal distribution within a true cluster, then we will find too many clusters.