lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

tic function memory leak? #509

Closed antonwnk closed 4 years ago

antonwnk commented 4 years ago

Hi! Apologies if this belongs to another repo.

I'm trying to extract TIC chromatograms from my 741 GC-MS samples (~32GB size on disk) from an OnDiskMSnExp object using the tic function (initial=FALSE) and BiocParallel::MulticoreParam to parallelize (I am running on Linux). This fails either with Error in result[[njob]]: attempt to select less than one element in OneIndex or with error writing to connection but corresponds to the process(es) being killed by the OOM handler when the memory usage goes above 90GB (the limit I set to the job). I can get around this by doing the following:

by_file = splitByFile(object, f = as.factor(fileNames(object)))
tic_split = bplapply(by_file, tic, initial = FALSE, BPPARAM = bpstart(mc.param))
bpstop(mc.param)
lgatto commented 4 years ago

Thank you for the report @antonwnk. I.t is surprising, as the tic applies essentially the same strategy

I assume you got these errors by running

tic(object, initial = FALSE)

Ping @jorainer

jorainer commented 4 years ago

The other question is how how many parallel processes you are using. I'd suggest to not use a too large number because a) you need to have enough memory to keep the full data from x files corresponding to the number of processes in memory and b) the disk i/o bottleneck will kick in.

Even for very large data sets and running on a cluster I tend to not use more than 16 cores in parallel (mostly I use 8 cores).

antonwnk commented 4 years ago

Hi, thanks for replying! object is the result of the following:

object = readMSData(mzml_file_names, mode = "onDisk", msLevel = 1)
object = filterRt(object, rt_filter_range)
object = filterMz(object, mz_filter_range)
object = clean(object)
object = filterEmptySpectra(object)

and indeed, that's how I get the error. On second look, the alternative method of splitting by file and extracting individually has similar memory usage so I imagine it may also fail if given a little less memory.

Following @jorainer's suggestion and limiting the number of BiocParallel workers got the tic command to run successfully with drastically lower memory usage and also in less time than it took the previous configuration to run OOM. I'm not sure I understand how the memory usage is supposed to scale with the number of processes. Is it ever supposed to exceed the size of the samples on disk?

sessionInfo()

``` R version 3.6.0 (2019-04-26) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] MSnbase_2.12.0 ProtGenerics_1.18.0 S4Vectors_0.24.3 [4] mzR_2.20.0 Rcpp_1.0.4 Biobase_2.46.0 [7] BiocGenerics_0.32.0 loaded via a namespace (and not attached): [1] BiocManager_1.30.10 pillar_1.4.3 compiler_3.6.0 [4] plyr_1.8.6 iterators_1.0.12 zlibbioc_1.32.0 [7] digest_0.6.25 ncdf4_1.17 MALDIquant_1.19.3 [10] lifecycle_0.2.0 tibble_2.1.3 preprocessCore_1.48.0 [13] gtable_0.3.0 lattice_0.20-40 pkgconfig_2.0.3 [16] rlang_0.4.5 foreach_1.4.8 dplyr_0.8.5 [19] IRanges_2.20.2 grid_3.6.0 tidyselect_1.0.0 [22] glue_1.3.2 impute_1.60.0 R6_2.4.1 [25] XML_3.99-0.3 BiocParallel_1.20.1 limma_3.42.2 [28] ggplot2_3.3.0 purrr_0.3.3 magrittr_1.5 [31] scales_1.1.0 pcaMethods_1.78.0 codetools_0.2-16 [34] MASS_7.3-51.5 mzID_1.24.0 assertthat_0.2.1 [37] colorspace_1.4-1 affy_1.64.0 doParallel_1.0.15 [40] munsell_0.5.0 vsn_3.54.0 crayon_1.3.4 [43] affyio_1.56.0 ```

jorainer commented 4 years ago

Is it ever supposed to exceed the size of the samples on disk?

this depends. The data on disk is compressed, once the m/z and intensity values are in memory they will need more memory. And then there is also the R property to copy objects instead of in-place replacement. This can result in an even larger memory demand then just having the data in memory once.

antonwnk commented 4 years ago

I see, that makes sense. I think this should be closed then. No problems when running with fewer jobs.

Thanks!