Closed ppaxisa closed 3 years ago
Hi @ppaxisa,
Thank you for sharing this.
Note that internally we're converting the sparse matrix to a dense matrix as we've previously had issues with allowing a full sparse matrix support. It could be that it's this step that's already giving us memory issues.
Your dataset is very large indeed and we haven't yet tried running tradeSeq
on datasets with 250k cells. I'm not quite sure therefore where the error is coming from, i.e., from converting to a dense matrix, mgcv
or somewhere else.
Here are a few suggestions that could get us going:
traceback()
after getting the error?icMat <- evaluateK(as.matrix(GEX[1:10,]), pseudotime = pseudotime, U = U, parallel = F)
, and similar (if previous works) icMat <- evaluateK(as.matrix(GEX[1:500,]), pseudotime = pseudotime, U = U, parallel = F)
, to check if it's the conversion to a dense matrix (which seems likely).nGenes=2
to make sure it's only a problem of the number of cells?offset = rep(log(1), ncol(GEX))
in evaluateK
to make sure it's not because of the normalization?Hi @koenvandenberge,
Thanks for the reply, I noticed afterwards the conversion to dense matrix, that's probably the issue.
When running evaluateK
on small number of genes (2,10,100) I get the following error:
> res100 <- evaluateK(GEX[1:100,],
pseudotime = pseudotime,
cellWeights = matrix(1, nrow = nrow(pseudotime)), # only one lineage
U = U,
parallel = F)
**Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'**
>
>
>
> traceback()
9: sample.int(length(x), size, replace, prob)
8: sample(seq_len(nrow(counts)), nGenes)
7: .evaluateK(counts = counts, k = k, U = U, pseudotime = pseudotime,
cellWeights = cellWeights, plot = plot, nGenes = nGenes,
weights = weights, offset = offset, verbose = verbose, conditions = conditions,
parallel = parallel, BPPARAM = BPPARAM, control = control,
family = family, gcv = gcv, ...)
6: .local(counts, ...)
5: evaluateK(counts = as.matrix(counts), k = k, nGenes = nGenes,
sds = sds, pseudotime = pseudotime, cellWeights = cellWeights,
plot = plot, U = U, weights = weights, offset = offset, aicDiff = aicDiff,
verbose = verbose, conditions = conditions, parallel = parallel,
BPPARAM = BPPARAM, control = control, family = family, gcv = gcv,
...)
4: evaluateK(counts = as.matrix(counts), k = k, nGenes = nGenes,
sds = sds, pseudotime = pseudotime, cellWeights = cellWeights,
plot = plot, U = U, weights = weights, offset = offset, aicDiff = aicDiff,
verbose = verbose, conditions = conditions, parallel = parallel,
BPPARAM = BPPARAM, control = control, family = family, gcv = gcv,
...)
3: .local(counts, ...)
2: evaluateK(GEX[1:100, ], pseudotime = pseudotime, cellWeights = matrix(1,
nrow = nrow(pseudotime)), U = U, parallel = F)
1: evaluateK(GEX[1:100, ], pseudotime = pseudotime, cellWeights = matrix(1,
nrow = nrow(pseudotime)), U = U, parallel = F)
Which I think is related to the library normalization from edgeR::calcnormfactors
? I'm guessing there is not enough data in each cell to properly calculate normalization factors.
When running with 500 genes as below, it works fine.
res500 <- evaluateK(GEX[1:500,],
pseudotime = pseudotime,
cellWeights = matrix(1, nrow = nrow(pseudotime)), # only one lineage
U = U,
parallel = F)
When running with nGenes = 2
:
> res <- evaluateK(GEX,
pseudotime = pseudotime,
cellWeights = matrix(1, nrow = nrow(pseudotime)), # only one lineage
U = U,
parallel = F,
nGenes = 2)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'counts' in selecting a method for function 'evaluateK': Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
> traceback()
10: h(simpleError(msg, call))
9: .handleSimpleError(function (cond)
.Internal(C_tryCatchHelper(addr, 1L, cond)), "Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102",
base::quote(asMethod(object)))
8: asMethod(object)
7: as(x, "matrix")
6: as.matrix.Matrix(counts)
5: as.matrix(counts)
4: evaluateK(counts = as.matrix(counts), k = k, nGenes = nGenes,
sds = sds, pseudotime = pseudotime, cellWeights = cellWeights,
plot = plot, U = U, weights = weights, offset = offset, aicDiff = aicDiff,
verbose = verbose, conditions = conditions, parallel = parallel,
BPPARAM = BPPARAM, control = control, family = family, gcv = gcv,
...)
3: .local(counts, ...)
2: evaluateK(GEX, pseudotime = pseudotime, cellWeights = matrix(1,
nrow = nrow(pseudotime)), U = U, parallel = F, nGenes = 2)
1: evaluateK(GEX, pseudotime = pseudotime, cellWeights = matrix(1,
nrow = nrow(pseudotime)), U = U, parallel = F, nGenes = 2)
>
I get again the memory error.
When running with offset = rep(log(1), ncol(GEX))
I get the same error and traceback.
I understand that the full matrix is only needed for edgeR:: calcnormfactors
, as all subsequent modeling step are done on each genes individually?
I was wondering if I can calculate the normalization factors beforehand with another method, and then partition my sparse matrix into small chunks of few hundred genes before running evaluateK
and fitGAM
on it, specifying the normalization factors with the offset
argument? That would also be practical for parallelization as I could submit each chunk on separate nodes. Would that make sense?
thanks again for the help!
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.9 (Maipo)
Matrix products: default
BLAS/LAPACK: libopenblasp-r0.3.15.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocParallel_1.26.1 tradeSeq_1.6.0 speedglm_0.3-3
[4] MASS_7.3-54 Matrix_1.3-4 future.apply_1.7.0
[7] future_1.21.0 harmony_0.1.0 Rcpp_1.0.7
[10] ggsci_2.9 SeuratObject_4.0.2 Seurat_4.0.3
[13] data.table_1.14.0 gtable_0.3.0 cowplot_1.1.1
[16] gplots_3.1.1 RColorBrewer_1.1-2 circlize_0.4.13
[19] ggrepel_0.9.1 scales_1.1.1 viridis_0.6.1
[22] viridisLite_0.4.0 broom_0.7.8 ggExtra_0.9
[25] ggplot2_3.3.5 stringr_1.4.0 tibble_3.1.3
[28] tidyr_1.1.3 dplyr_1.0.7 readr_1.4.0
[31] knitr_1.33 gridExtra_2.3
loaded via a namespace (and not attached):
[1] backports_1.2.1 VGAM_1.1-5
[3] plyr_1.8.6 igraph_1.2.6
[5] lazyeval_0.2.2 splines_4.1.0
[7] densityClust_0.3 listenv_0.8.0
[9] scattermore_0.7 fastICA_1.2-2
[11] GenomeInfoDb_1.28.1 digest_0.6.27
[13] htmltools_0.5.1.1 fansi_0.4.2
[15] magrittr_2.0.1 tensor_1.5
[17] cluster_2.1.2 ROCR_1.0-11
[19] limma_3.48.1 globals_0.14.0
[21] matrixStats_0.59.0 docopt_0.7.1
[23] spatstat.sparse_2.0-0 princurve_2.1.6
[25] colorspace_2.0-2 xfun_0.24
[27] sparsesvd_0.2 crayon_1.4.1
[29] RCurl_1.98-1.3 jsonlite_1.7.2
[31] spatstat.data_2.1-0 survival_3.2-11
[33] zoo_1.8-9 glue_1.4.2
[35] polyclip_1.10-0 zlibbioc_1.38.0
[37] XVector_0.32.0 leiden_0.3.8
[39] DelayedArray_0.18.0 shape_1.4.6
[41] SingleCellExperiment_1.14.1 BiocGenerics_0.38.0
[43] abind_1.4-5 pheatmap_1.0.12
[45] edgeR_3.34.0 DBI_1.1.1
[47] miniUI_0.1.1.1 TrajectoryUtils_1.0.0
[49] xtable_1.8-4 reticulate_1.20
[51] spatstat.core_2.3-0 stats4_4.1.0
[53] htmlwidgets_1.5.3 httr_1.4.2
[55] FNN_1.1.3 ellipsis_0.3.2
[57] ica_1.0-2 pkgconfig_2.0.3
[59] uwot_0.1.10 deldir_0.2-10
[61] locfit_1.5-9.4 utf8_1.2.2
[63] tidyselect_1.1.1 rlang_0.4.11
[65] reshape2_1.4.4 later_1.2.0
[67] munsell_0.5.0 tools_4.1.0
[69] cli_3.0.1 generics_0.1.0
[71] ggridges_0.5.3 fastmap_1.1.0
[73] goftest_1.2-2 fitdistrplus_1.1-5
[75] DDRTree_0.1.5 caTools_1.18.2
[77] purrr_0.3.4 RANN_2.6.1
[79] pbapply_1.4-3 nlme_3.1-152
[81] mime_0.11 slam_0.1-48
[83] monocle_2.20.0 rstudioapi_0.13
[85] compiler_4.1.0 plotly_4.9.4.1
[87] png_0.1-7 spatstat.utils_2.2-0
[89] stringi_1.7.3 lattice_0.20-44
[91] HSMMSingleCell_1.12.0 vctrs_0.3.8
[93] pillar_1.6.1 lifecycle_1.0.0
[95] combinat_0.0-8 spatstat.geom_2.2-2
[97] lmtest_0.9-38 GlobalOptions_0.1.2
[99] RcppAnnoy_0.0.18 bitops_1.0-7
[101] irlba_2.3.3 httpuv_1.6.1
[103] patchwork_1.1.1 GenomicRanges_1.44.0
[105] R6_2.5.0 promises_1.2.0.1
[107] KernSmooth_2.23-20 IRanges_2.26.0
[109] parallelly_1.27.0 codetools_0.2-18
[111] gtools_3.9.2 assertthat_0.2.1
[113] SummarizedExperiment_1.22.0 withr_2.4.2
[115] qlcMatrix_0.9.7 sctransform_0.3.2
[117] S4Vectors_0.30.0 GenomeInfoDbData_1.2.6
[119] mgcv_1.8-36 parallel_4.1.0
[121] hms_1.1.0 grid_4.1.0
[123] rpart_4.1-15 MatrixGenerics_1.4.0
[125] slingshot_2.0.0 Rtsne_0.15
[127] Biobase_2.52.0 shiny_1.6.0
Hi @ppaxisa
The first code chunk errored because nGenes
is by default equal to 500
and we only give 100 genes to evaluateK
. I forgot to change the nGenes
default on the code I suggested, my bad. This then also explains why the second chunk works, instead.
The memory error indeed comes from trying to convert the sparse matrix into a dense matrix. edgeR::calcNormFactors
can actually handle sparse matrices.
I have updated evaluateK
and it should now allow for large sparse matrices as input without converting the full matrix to a dense matrix. However, within fitGAM
it will convert the matrix of the randomly selected genes (hence 500 genes by default) to a dense matrix. But this shouldn't be an issue on your system since your second chunk ran. The only difference between the current modifications and your second code chunk is that using the current update you'll normalize using all the genes, and not just the subset as you were doing in your second code chunk.
Let me know if this works for you.
Hi @koenvandenberge,
Thanks I think that works for me. I would just point out that I am not sure that edgeR::calcNormFactors
will actually handle the sparse matrix, it looks like it will convert internally to a dense matrix from looking at the source code. Is that right? Obviously nothing you could do if that's the case.
Hi @ppaxisa
I had just checked whether edgeR::calcNormFactors
accepts a sparse matrix, in order to identify the root of the error.
It is true, as you say, that it does convert the sparse matrix to a dense matrix. But then I do find it surprising that you're running into memory issues when evaluateK
converts the matrix to a dense matrix, but not when calcNormFactors
does this...
If you're planning to fit all genes and you're still having the problem when running fitGAM
, it would actually be useful to follow what you previously wrote: (a) normalize using all genes, (b) run fitGAM
on chunks of your matrix, providing the normalization factors on the entire data matrix.
Hi @koenvandenberge,
Thanks for looking into this and thanks again for the package. I think that closes the issue.
Hi, I am trying to use
tradeSeq
on a large dataset: ~250K cells by ~10K genes. I am also adding a model matrixU
with ~70 indicator variables to account for various batch effects. RunningevaluateK
with a dgcMatrixGEX
andpseudotime
with one lineage (one column):icMat <- evaluateK(GEX, pseudotime = pseudotime, U = U, parallel = F)
I get the following error:
Error in h(simpleError(msg, call)) : error in evaluating the argument 'counts' in selecting a method for function 'evaluateK': Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
What would be the best approach here? I could downsample the number of cells or use less genes? If I understand correctly the default here is 500 genes to test K so the problem is more about the number of cells?
Thanks