romanhaa / Cerebro

Visualization of scRNA-seq data.
MIT License
90 stars 19 forks source link

Cholmod error in performGeneSetEnrichmentAnalysis(); large dataset #31

Closed mAGLAVE closed 2 years ago

mAGLAVE commented 3 years ago

Hello, Thank your for your amazing job! Our biologists greatly appreciate your application!

I am currently working on a large dataset (around 60,000 cells for 40,000 genes), and I have an error during the GSEA.

sobj <- cerebroApp::performGeneSetEnrichmentAnalysis(object = sobj, assay = "RNA", GMT_file = gmt.file, parallel.sz = 4)
[16:03:25] Loading gene sets...
[16:03:25] Loaded 50 gene sets from GMT file.
[16:03:25] Extracting transcript counts...
Error in asMethod(object) :
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

I don't have this problem with a little dataset. I tried with a larger memory (I work on a computing cluster), but the problem persists (I allows 150Go, but it use only 70Go). Do you have a solution?

romanhaa commented 3 years ago

Hi @mAGLAVE! Thank you for the nice words, It's always nice to hear that people find Cerebro useful! Regarding the issue you described, I have a few questions:

1) Which version of cerebroApp are you using? 2) What is the memory footprint of the matrix (object.size(<matrix>))? 3) Do you think you could reduce the number of genes, e.g. by removing very lowly expressed genes? 4) I don't expect this to resolve the issue but have you tried different values for parallel.sz?

I have performed that analysis on similarly sized data sets in terms of cells, but usually about half as many genes. Finally, you could try doing the analysis manually as I have described here: https://romanhaa.github.io/projects/scrnaseq_workflow/#gene-set-enrichment-analysis As a little bonus, it'll show you how to make a nice plot out of it :)

mAGLAVE commented 3 years ago

Hi, Thank you for answering so quickly!

  1. I use 1.2.2 version. I haven't uploaded my pipeline with version 1.3 yet. But there is no update between version 1.2 and 1.3, is there?
> object.size(sobj)
7898679920 bytes
> object.size(sobj@assays$RNA@counts)
1968065264 bytes
  1. It is 10X technology, so all genes are lowly expressed. And I don't have the biological knowledge to really make this selection.
  2. I just tested it, but it doesn't change.

Thank you for your tips, I will test it ;)

romanhaa commented 3 years ago

I use 1.2.2 version. I haven't uploaded my pipeline with version 1.3 yet. But there is no update between version 1.2 and 1.3, is there?

Actually, there was an update to that function. Before v1.3, I unnecessarily converted the matrix into dense format before performing the analysis. In the newer version, it stays sparse, dramatically reducing the memory footprint (given that the matrix is largely sparse). If you don't mind, try again with v1.3 and see whether that solves the issue.

You're right about most genes being expressed at very low levels in 10X data. In my opinion, a solid way to filter genes is based on how many cells actually express the gene (at least 1 transcript). Most analysis frameworks have an option for that built-in. In Seurat, when you create the Seurat object with CreateSeuratObject(), you can just pass min.cells in there, e.g. with a value of 10. Not only will that help to reduce dimensions of the matrix, it might also remove some noise in the data.

mAGLAVE commented 3 years ago

Ooh, I'll update the pipeline quickly then! That should solve the problem, thanks! Is it possible to convert cerebro objects from version 1.2.2 to version 1.3, without regenerating them from seurat? (we cannot read cerebro objects generated with version 1.2.2 on the Shiny App version 1.3)

Actually, I am already practicing this filter (a gene must be expressed in at least 5 cells), but here it is integrated data (15 datasets).