theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

`writeH5AD`: high memory usage #64

Closed GabrielHoffman closed 2 years ago

GabrielHoffman commented 2 years ago

I have been using zellkonverter::writeH5AD() to convert text files from the Single Cell Portal to H5AD for downstream analysis. I import the single cell counts from text as a sparseMatrix, and I noticed that writeH5AD() can use a huge amount of memory writing to disk. When I need to format large datasets, I need to use a high-memory machine since conversion can use more than 128 Gb.

Is this on the user-side or the backend?

Reproducible example:

library(zellkonverter)
library(SingleCellExperiment)
library(Matrix)

# Simulate a dataset as a sparseMatrix
ngenes = 30000
ncells = 100000

counts = rsparsematrix(ngenes, ncells, density=0.05)
rownames(counts) = paste0("gene_", 1:ngenes)
colnames(counts) = paste0("cell_", 1:ncells)

format(object.size(counts), "Gb")
# "1.7 Gb"

sce = SingleCellExperiment(assay=list(counts = counts))
format(object.size(sce), "Gb")
# "1.7 Gb"

# Uses ~15 Gb of memory, measured by `top`
writeH5AD(sce, file="test.h5ad", compression="gzip")

sessionInfo

```r > sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /hpc/packages/minerva-centos7/intel/parallel_studio_xe_2019/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices datasets utils [8] methods base other attached packages: [1] Matrix_1.4-0 SingleCellExperiment_1.14.1 [3] SummarizedExperiment_1.22.0 Biobase_2.52.0 [5] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4 [7] IRanges_2.26.0 S4Vectors_0.30.2 [9] BiocGenerics_0.38.0 MatrixGenerics_1.4.3 [11] matrixStats_0.62.0 zellkonverter_1.7.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.8 XVector_0.32.0 zlibbioc_1.38.0 [4] here_1.0.1 lattice_0.20-44 tools_4.1.0 [7] grid_4.1.0 png_0.1-7 cli_3.2.0 [10] basilisk_1.4.0 rprojroot_2.0.2 GenomeInfoDbData_1.2.6 [13] dir.expiry_1.0.0 bitops_1.0-7 basilisk.utils_1.4.0 [16] RCurl_1.98-1.3 glue_1.6.2 DelayedArray_0.18.0 [19] compiler_4.1.0 filelock_1.0.2 jsonlite_1.8.0 [22] reticulate_1.24 ```
lazappi commented 2 years ago

HI @GabrielHoffman

Thanks for giving {zellkonverter} a go! Memory usage is something we haven't really looked into much yet so I'm not surprised there are some issues with big datasets. It would be useful to try and work out exactly which parts are using more memory. Possibly it's from somewhere we don't have a lot of control over but we will have to see.

I was able to run your example on my laptop with ~4-5 GB memory usage so potentially there are some system specific things as well.

GabrielHoffman commented 2 years ago

I suspect is due to

https://github.com/theislab/zellkonverter/blob/0cd62bcf5092cbeb6b444c409395a35eebed28be/R/SCE2AnnData.R#L58

Running gc() after this line frees up memory. But for large datasets, my R session crashing for insufficient memory here.

Thats all I have time for right now.

Gabriel

lazappi commented 2 years ago

Thanks. If this is the issue there's probably not a lot we can do. This conversion is handled by {reticulate} and I think there is always going to be an overhead moving between environments. We could see if forcing garbage collection helps but I generally prefer not to mess with that.