statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
238 stars 27 forks source link

Memory limitation for large dataset #159

Closed ppaxisa closed 3 years ago

ppaxisa commented 3 years ago

Hi, I am trying to use tradeSeq on a large dataset: ~250K cells by ~10K genes. I am also adding a model matrix U with ~70 indicator variables to account for various batch effects. Running evaluateK with a dgcMatrix GEX and pseudotime with one lineage (one column):

icMat <- evaluateK(GEX, pseudotime = pseudotime, U = U, parallel = F)

I get the following error:

Error in h(simpleError(msg, call)) : error in evaluating the argument 'counts' in selecting a method for function 'evaluateK': Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102

What would be the best approach here? I could downsample the number of cells or use less genes? If I understand correctly the default here is 500 genes to test K so the problem is more about the number of cells?

Thanks

koenvandenberge commented 3 years ago

Hi @ppaxisa,

Thank you for sharing this.

Note that internally we're converting the sparse matrix to a dense matrix as we've previously had issues with allowing a full sparse matrix support. It could be that it's this step that's already giving us memory issues.

Your dataset is very large indeed and we haven't yet tried running tradeSeq on datasets with 250k cells. I'm not quite sure therefore where the error is coming from, i.e., from converting to a dense matrix, mgcv or somewhere else. Here are a few suggestions that could get us going:

ppaxisa commented 3 years ago

Hi @koenvandenberge,

Thanks for the reply, I noticed afterwards the conversion to dense matrix, that's probably the issue.

When running evaluateK on small number of genes (2,10,100) I get the following error:

> res100 <- evaluateK(GEX[1:100,],
                    pseudotime = pseudotime,
                    cellWeights = matrix(1, nrow = nrow(pseudotime)), # only one lineage
                    U = U,
                    parallel = F)
**Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'**
> 
> 
> 
> traceback()
9: sample.int(length(x), size, replace, prob)
8: sample(seq_len(nrow(counts)), nGenes)
7: .evaluateK(counts = counts, k = k, U = U, pseudotime = pseudotime, 
       cellWeights = cellWeights, plot = plot, nGenes = nGenes, 
       weights = weights, offset = offset, verbose = verbose, conditions = conditions, 
       parallel = parallel, BPPARAM = BPPARAM, control = control, 
       family = family, gcv = gcv, ...)
6: .local(counts, ...)
5: evaluateK(counts = as.matrix(counts), k = k, nGenes = nGenes, 
       sds = sds, pseudotime = pseudotime, cellWeights = cellWeights, 
       plot = plot, U = U, weights = weights, offset = offset, aicDiff = aicDiff, 
       verbose = verbose, conditions = conditions, parallel = parallel, 
       BPPARAM = BPPARAM, control = control, family = family, gcv = gcv, 
       ...)
4: evaluateK(counts = as.matrix(counts), k = k, nGenes = nGenes, 
       sds = sds, pseudotime = pseudotime, cellWeights = cellWeights, 
       plot = plot, U = U, weights = weights, offset = offset, aicDiff = aicDiff, 
       verbose = verbose, conditions = conditions, parallel = parallel, 
       BPPARAM = BPPARAM, control = control, family = family, gcv = gcv, 
       ...)
3: .local(counts, ...)
2: evaluateK(GEX[1:100, ], pseudotime = pseudotime, cellWeights = matrix(1, 
       nrow = nrow(pseudotime)), U = U, parallel = F)
1: evaluateK(GEX[1:100, ], pseudotime = pseudotime, cellWeights = matrix(1, 
       nrow = nrow(pseudotime)), U = U, parallel = F)

Which I think is related to the library normalization from edgeR::calcnormfactors? I'm guessing there is not enough data in each cell to properly calculate normalization factors.

When running with 500 genes as below, it works fine.

res500 <- evaluateK(GEX[1:500,],
                    pseudotime = pseudotime,
                    cellWeights = matrix(1, nrow = nrow(pseudotime)), # only one lineage
                    U = U,
                    parallel = F)

When running with nGenes = 2:

> res <- evaluateK(GEX,
                    pseudotime = pseudotime,
                    cellWeights = matrix(1, nrow = nrow(pseudotime)), # only one lineage
                    U = U,
                    parallel = F,
                    nGenes = 2)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'counts' in selecting a method for function 'evaluateK': Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
> traceback()
10: h(simpleError(msg, call))
9: .handleSimpleError(function (cond) 
   .Internal(C_tryCatchHelper(addr, 1L, cond)), "Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102", 
       base::quote(asMethod(object)))
8: asMethod(object)
7: as(x, "matrix")
6: as.matrix.Matrix(counts)
5: as.matrix(counts)
4: evaluateK(counts = as.matrix(counts), k = k, nGenes = nGenes, 
       sds = sds, pseudotime = pseudotime, cellWeights = cellWeights, 
       plot = plot, U = U, weights = weights, offset = offset, aicDiff = aicDiff, 
       verbose = verbose, conditions = conditions, parallel = parallel, 
       BPPARAM = BPPARAM, control = control, family = family, gcv = gcv, 
       ...)
3: .local(counts, ...)
2: evaluateK(GEX, pseudotime = pseudotime, cellWeights = matrix(1, 
       nrow = nrow(pseudotime)), U = U, parallel = F, nGenes = 2)
1: evaluateK(GEX, pseudotime = pseudotime, cellWeights = matrix(1, 
       nrow = nrow(pseudotime)), U = U, parallel = F, nGenes = 2)
> 

I get again the memory error.

When running with offset = rep(log(1), ncol(GEX)) I get the same error and traceback.

I understand that the full matrix is only needed for edgeR:: calcnormfactors, as all subsequent modeling step are done on each genes individually?

I was wondering if I can calculate the normalization factors beforehand with another method, and then partition my sparse matrix into small chunks of few hundred genes before running evaluateK and fitGAM on it, specifying the normalization factors with the offset argument? That would also be practical for parallelization as I could submit each chunk on separate nodes. Would that make sense?

thanks again for the help!


> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)

Matrix products: default
BLAS/LAPACK: libopenblasp-r0.3.15.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocParallel_1.26.1 tradeSeq_1.6.0      speedglm_0.3-3     
 [4] MASS_7.3-54         Matrix_1.3-4        future.apply_1.7.0 
 [7] future_1.21.0       harmony_0.1.0       Rcpp_1.0.7         
[10] ggsci_2.9           SeuratObject_4.0.2  Seurat_4.0.3       
[13] data.table_1.14.0   gtable_0.3.0        cowplot_1.1.1      
[16] gplots_3.1.1        RColorBrewer_1.1-2  circlize_0.4.13    
[19] ggrepel_0.9.1       scales_1.1.1        viridis_0.6.1      
[22] viridisLite_0.4.0   broom_0.7.8         ggExtra_0.9        
[25] ggplot2_3.3.5       stringr_1.4.0       tibble_3.1.3       
[28] tidyr_1.1.3         dplyr_1.0.7         readr_1.4.0        
[31] knitr_1.33          gridExtra_2.3      

loaded via a namespace (and not attached):
  [1] backports_1.2.1             VGAM_1.1-5                 
  [3] plyr_1.8.6                  igraph_1.2.6               
  [5] lazyeval_0.2.2              splines_4.1.0              
  [7] densityClust_0.3            listenv_0.8.0              
  [9] scattermore_0.7             fastICA_1.2-2              
 [11] GenomeInfoDb_1.28.1         digest_0.6.27              
 [13] htmltools_0.5.1.1           fansi_0.4.2                
 [15] magrittr_2.0.1              tensor_1.5                 
 [17] cluster_2.1.2               ROCR_1.0-11                
 [19] limma_3.48.1                globals_0.14.0             
 [21] matrixStats_0.59.0          docopt_0.7.1               
 [23] spatstat.sparse_2.0-0       princurve_2.1.6            
 [25] colorspace_2.0-2            xfun_0.24                  
 [27] sparsesvd_0.2               crayon_1.4.1               
 [29] RCurl_1.98-1.3              jsonlite_1.7.2             
 [31] spatstat.data_2.1-0         survival_3.2-11            
 [33] zoo_1.8-9                   glue_1.4.2                 
 [35] polyclip_1.10-0             zlibbioc_1.38.0            
 [37] XVector_0.32.0              leiden_0.3.8               
 [39] DelayedArray_0.18.0         shape_1.4.6                
 [41] SingleCellExperiment_1.14.1 BiocGenerics_0.38.0        
 [43] abind_1.4-5                 pheatmap_1.0.12            
 [45] edgeR_3.34.0                DBI_1.1.1                  
 [47] miniUI_0.1.1.1              TrajectoryUtils_1.0.0      
 [49] xtable_1.8-4                reticulate_1.20            
 [51] spatstat.core_2.3-0         stats4_4.1.0               
 [53] htmlwidgets_1.5.3           httr_1.4.2                 
 [55] FNN_1.1.3                   ellipsis_0.3.2             
 [57] ica_1.0-2                   pkgconfig_2.0.3            
 [59] uwot_0.1.10                 deldir_0.2-10              
 [61] locfit_1.5-9.4              utf8_1.2.2                 
 [63] tidyselect_1.1.1            rlang_0.4.11               
 [65] reshape2_1.4.4              later_1.2.0                
 [67] munsell_0.5.0               tools_4.1.0                
 [69] cli_3.0.1                   generics_0.1.0             
 [71] ggridges_0.5.3              fastmap_1.1.0              
 [73] goftest_1.2-2               fitdistrplus_1.1-5         
 [75] DDRTree_0.1.5               caTools_1.18.2             
 [77] purrr_0.3.4                 RANN_2.6.1                 
 [79] pbapply_1.4-3               nlme_3.1-152               
 [81] mime_0.11                   slam_0.1-48                
 [83] monocle_2.20.0              rstudioapi_0.13            
 [85] compiler_4.1.0              plotly_4.9.4.1             
 [87] png_0.1-7                   spatstat.utils_2.2-0       
 [89] stringi_1.7.3               lattice_0.20-44            
 [91] HSMMSingleCell_1.12.0       vctrs_0.3.8                
 [93] pillar_1.6.1                lifecycle_1.0.0            
 [95] combinat_0.0-8              spatstat.geom_2.2-2        
 [97] lmtest_0.9-38               GlobalOptions_0.1.2        
 [99] RcppAnnoy_0.0.18            bitops_1.0-7               
[101] irlba_2.3.3                 httpuv_1.6.1               
[103] patchwork_1.1.1             GenomicRanges_1.44.0       
[105] R6_2.5.0                    promises_1.2.0.1           
[107] KernSmooth_2.23-20          IRanges_2.26.0             
[109] parallelly_1.27.0           codetools_0.2-18           
[111] gtools_3.9.2                assertthat_0.2.1           
[113] SummarizedExperiment_1.22.0 withr_2.4.2                
[115] qlcMatrix_0.9.7             sctransform_0.3.2          
[117] S4Vectors_0.30.0            GenomeInfoDbData_1.2.6     
[119] mgcv_1.8-36                 parallel_4.1.0             
[121] hms_1.1.0                   grid_4.1.0                 
[123] rpart_4.1-15                MatrixGenerics_1.4.0       
[125] slingshot_2.0.0             Rtsne_0.15                 
[127] Biobase_2.52.0              shiny_1.6.0  
koenvandenberge commented 3 years ago

Hi @ppaxisa

The first code chunk errored because nGenes is by default equal to 500 and we only give 100 genes to evaluateK. I forgot to change the nGenes default on the code I suggested, my bad. This then also explains why the second chunk works, instead.

The memory error indeed comes from trying to convert the sparse matrix into a dense matrix. edgeR::calcNormFactors can actually handle sparse matrices.

I have updated evaluateK and it should now allow for large sparse matrices as input without converting the full matrix to a dense matrix. However, within fitGAM it will convert the matrix of the randomly selected genes (hence 500 genes by default) to a dense matrix. But this shouldn't be an issue on your system since your second chunk ran. The only difference between the current modifications and your second code chunk is that using the current update you'll normalize using all the genes, and not just the subset as you were doing in your second code chunk.

Let me know if this works for you.

ppaxisa commented 3 years ago

Hi @koenvandenberge,

Thanks I think that works for me. I would just point out that I am not sure that edgeR::calcNormFactors will actually handle the sparse matrix, it looks like it will convert internally to a dense matrix from looking at the source code. Is that right? Obviously nothing you could do if that's the case.

koenvandenberge commented 3 years ago

Hi @ppaxisa

I had just checked whether edgeR::calcNormFactors accepts a sparse matrix, in order to identify the root of the error. It is true, as you say, that it does convert the sparse matrix to a dense matrix. But then I do find it surprising that you're running into memory issues when evaluateK converts the matrix to a dense matrix, but not when calcNormFactors does this...

If you're planning to fit all genes and you're still having the problem when running fitGAM, it would actually be useful to follow what you previously wrote: (a) normalize using all genes, (b) run fitGAM on chunks of your matrix, providing the normalization factors on the entire data matrix.

ppaxisa commented 3 years ago

Hi @koenvandenberge,

Thanks for looking into this and thanks again for the package. I think that closes the issue.