Closed AlasdairCUPEI closed 3 years ago
BuildClusterTree was meant to perform hierarchical clustering on the pseudobulk averages of different clusters, to understand the potential hierarchical relationships between them. We do not run hierarchical clustering on the single-cells .
You can use this function as a shortcut to calculating pseudobulk cluster averages, and then running hclust.
Hi everyone, I have been trying to use hclust to perform hierarchical clustering on my scRNA-seq data, but I've been running into some issues:
Creating a distance matrix takes several hours to perform due to the large size of my dataset (~2000 cells, ~17000 total genes)
Subsequently, when trying to run the hclust command, an error appears, stating that the size of my data is too large and it cannot perform the actual clustering. This also happens when I use the associated "fastcluster" package. Therefore, I was wondering: would it be considered acceptable or appropriate to perhaps create subsets of the data and run the hierarchical clustering on those subsets and compare them?
I subsequently stumbled upon the "BuildClusterTree" command when searching through the documentation for the Seurat program. The description is as follows: "Constructs a phylogenetic tree relating the 'average' cell from each identity class. Tree is estimated based on a distance matrix constructed in either gene expression space or PCA space". This command, when run, worked in a matter of seconds, and produced a hierarchy that generally made sense to me. I have the following questions:
As mentioned, I am aware of hclust, and I am currently trying to set up Monocle. Are there any other recommendations for good programs written in R that perform hierarchical clustering?
Finally, I was wondering about using a statistical metric such as a Pearson (or Bayesian) correlation to compare how closely related some clusters are. Would a Pearson or Bayesian correlation be an acceptable statistic to use when examining scRNA-seq data, or are there others that should be used instead? If so, I was also wondering what the proper procedure would be to correctly test this using scRNA-seq data.
Thank you very much in advance! I apologize for the length of this post.