waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
70 stars 32 forks source link

Harmonization makes unnecessary copies of large datasets #299

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Consider this:

library(MultiAssayExperiment)
se <- SummarizedExperiment(list(counts=matrix(0, 20000, 20000)))
colnames(se) <- paste0("SAMPLE_", seq_len(ncol(se)))
mae <- MultiAssayExperiment(list(gene=se))

se is about 3.2 GB in size. Storing it in mae makes another copy because .harmonize() does a round of unnecessary subsetting, even though it just ends up keeping all the columns. Obviously, this is not good, these are big objects when you're dealing with single-cell datasets and a data container like the MultiAssayExperiment should at least be smart enough to elide the copy. (SummarizedExperiment() manages to do so.) Now, that kind of stuff is bad enough in the constructor, but then:

mae2 <- mae
experiments(mae2) <- experiments(mae2)

also triggers another allocation, because experiments<- calls .harmonize(), which then makes another copy again. Hopefully you can imagine how frustrating this is when you're dealing with 5-10 GB SingleCellExperiment objects and you run out of memory (even on a node with 80 GB of RAM!) when you're trying to store them inside a MultiAssayExperiment.

This is another case of harmonization adding more problems than it solves. Low-level methods like [ and experiments<- should be unsurprising as possible; in this case, we should just get a direct replacement with minimal overhead, erroring if the resulting object is invalid. Harmonization should be pushed up to higher-level functions where it won't do as much damage.

Session info ``` R Under development (unstable) (2021-01-24 r79876) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /home/luna/Software/R/trunk/lib/libRblas.so LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] MultiAssayExperiment_1.17.13 SummarizedExperiment_1.21.1 [3] Biobase_2.51.0 GenomicRanges_1.43.3 [5] GenomeInfoDb_1.27.6 IRanges_2.25.6 [7] S4Vectors_0.29.7 BiocGenerics_0.37.1 [9] MatrixGenerics_1.3.1 matrixStats_0.58.0 loaded via a namespace (and not attached): [1] lattice_0.20-41 rhdf5filters_1.3.4 bitops_1.0-6 [4] grid_4.1.0 zlibbioc_1.37.0 XVector_0.31.1 [7] Matrix_1.3-2 Rhdf5lib_1.13.4 tools_4.1.0 [10] RCurl_1.98-1.2 HDF5Array_1.19.5 DelayedArray_0.17.9 [13] compiler_4.1.0 rhdf5_2.35.2 GenomeInfoDbData_1.2.4 ```
LiNk-NY commented 3 years ago

It should be resolved in e1a6c793f19fbd3cfd6645e9f2aa99885d72fb24 when colnames in object and value are identical. If they're not .harmonize needs to be run. Thanks for reporting. Best, Marcel