igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
87 stars 10 forks source link

testClusters function ends abruptly and gives a single cluster assignment to all cells #24

Closed amruta2612 closed 3 weeks ago

amruta2612 commented 3 weeks ago

I am analysing a few publicly available scRNA-seq datasets. I use graph-based community detection (kNN + Leiden) to get clusters at different resolution values, and use these assignments as input to the testClusters function. I have cluster assignments from a particular resolution value saved as a vector "clusters". The raw counts data is stored in a matrix "counts", and list of genes of interest as a vector "list". I run the following command:

scSHC::testClusters(data=counts, cluster_ids=clusters, var.genes=list, parallel=T, cores=1)

Problem: the output is dependent on the initial cluster assignments/number of clusters provided. It occasionally works, but sometimes the algorithm ends within a few seconds and outputs the same cluster assignment 'new1' to all cells in the dataset. I then try different resolution values until I get a result of >1 cluster assignments. Sometimes this does not work at all. And for all these datasets, when I run the scSHC function (allowing it to perform its own clustering), it gives a reasonable number of statistically significant clusters.

My theory: this is because sometimes the graph-based clustering algorithm gives varied cluster sizes; lowest ~5-6 cells and highest ~7000 cells. Since permutation testing is sensitive to cluster sizes, I suspect smaller clusters would not be as robust.

Has anybody else encountered this? Any explanations? Thank you :)

igrabski commented 3 weeks ago

Hi, thanks for your question!

First, just to double-check about the genes of interest -- the parameter var.genes is intended to represent the full set of features on which clustering is based (for instance, by default, this is a set of 2,500 genes), so if that list is too small or otherwise not representative enough of variation in the data, there may be odd results.

Second, it may indeed be possible that the very small cluster sizes in particular could cause issues, and the testClusters function may not be as robust for evaluating super tiny clusters. In general, however, I would be somewhat skeptical about clusters that are as small as 5-6 cells.

Finally, it is expected that testClusters could give different outputs depending on the initial clusters, because it is intended only to evaluate the provided clusters. As an extreme example, if a dataset contains several true different populations, but completely meaningless cluster labels were provided, then testClusters ideally should collapse them all into a single cluster, because that's the best description based on the provided clusters.

I hope that helps and please let me know if you have further questions!

amruta2612 commented 3 weeks ago

Thank you for the response!

Typically, my gene list is a list of ~3000 genes corresponding to the top 50 PCA dimensions, so this isn't likely to be the issue. I agree that it is more likely to be the cluster sizes and cluster assignments themselves.

However, while varying the 'k' value to get bigger communities, I encountered the following error when I ran testClusters:

Error: TridiagEigen: eigen decomposition failed

Any suggestions on how to resolve this?

Thank you!

igrabski commented 3 weeks ago

Got it, your gene list makes sense then!

Hm, can you confirm that you are passing in raw counts rather than log-normalized data into testClusters (e.g., if taking data from a Seurat object, the counts assay should be passed rather than data)? If so and you are still get this error, would you be able to share your data and cluster labels (e.g. you could email igrabski[at]nygenome[dot]org, or whatever is easiest), and I can take a look!

amruta2612 commented 3 weeks ago

Yes, I pass the raw counts matrix.

Sounds good, I should be able to email it to you- would you like the counts matrix as a .rds file (can be opened in R)?

igrabski commented 3 weeks ago

Yes, an .rds file works! Please also include the cluster labels.

igrabski commented 3 weeks ago

Oh also please include var.genes!

amruta2612 commented 3 weeks ago

I have just sent everything to your email address. Please let me know if there's any other information I can give that helps.

Thanks!