igrabski / sc-SHC

Significance analysis for clustering single-cell RNA-sequencing data
87 stars 10 forks source link

High memory usage #5

Closed mihem closed 1 year ago

mihem commented 1 year ago

@igrabski Thanks for this great and very much needed package and congratulations on the publication.

When I ran this on my machine with 64GB RAM with my dataset (172049 samples and 32786 genes), I quickly ran out of memory. I deactivated parallel usage because this often increases memory usage a lot, but even then I ran out of memory.

I downsampled my Seurat object drastically to 100 cells per cluster with 19 clusters (roughly 2,000 cells). testClusters ran successfully then, but the result was not plausible. Glia cells were supposed to be merged with immune cells in the "corrected version".

I then downsamples my Seurat object to 1000 cells per cluster (roughly 18,000 because a few have less than 1000), testClusters also ran successfully, now the result was much more plausible with 17 cluster.

-> I think this method is difficult to use for larger datasets with many clusters. If you downsample, the significance will reduce drastically, if you don't downsample, you will run into memory issues.

I tried to debug that:, the crash occurs here generate_null_statistic https://github.com/igrabski/sc-SHC/blob/2c7fb82e8a190732a7fe933bae914d1969811348/R/clustering.R#L145C3-L145C8

Are there any ideas to tackle these problems? Like refactoring the code or using C++. Of note, I can easily process this dataset with all Seurat function (Normalizing, Clustering, UMAP, DE analysis) on my machine.

Thank you.

igrabski commented 1 year ago

Thank you for trying our package and for the useful feedback! As you noted, we have found that downsampling to very small amounts of cells tends to result in overly conservative results, and then there is generally an increase in performance with number of cells until a certain point where performance tends to stabilize. However, it is difficult to give precise guidelines on downsampling because the number of cells needed for strong performance depends a lot on how different the clusters are, how much noise is present in the data, etc. So, in general, for larger datasets, we usually recommend analysis on a cluster or RStudio server if possible. In future iterations of this package, though, we do plan to investigate options to improve memory usage for more memory-intensive steps.