Closed antonwnk closed 4 years ago
Thank you for the report @antonwnk. I.t is surprising, as the tic
applies essentially the same strategy
I assume you got these errors by running
tic(object, initial = FALSE)
object
created?sessionInfo
after loading MSnbase
.Ping @jorainer
The other question is how how many parallel processes you are using. I'd suggest to not use a too large number because a) you need to have enough memory to keep the full data from x files corresponding to the number of processes in memory and b) the disk i/o bottleneck will kick in.
Even for very large data sets and running on a cluster I tend to not use more than 16 cores in parallel (mostly I use 8 cores).
Hi, thanks for replying!
object
is the result of the following:
object = readMSData(mzml_file_names, mode = "onDisk", msLevel = 1)
object = filterRt(object, rt_filter_range)
object = filterMz(object, mz_filter_range)
object = clean(object)
object = filterEmptySpectra(object)
and indeed, that's how I get the error. On second look, the alternative method of splitting by file and extracting individually has similar memory usage so I imagine it may also fail if given a little less memory.
Following @jorainer's suggestion and limiting the number of BiocParallel workers got the tic
command to run successfully with drastically lower memory usage and also in less time than it took the previous configuration to run OOM.
I'm not sure I understand how the memory usage is supposed to scale with the number of processes. Is it ever supposed to exceed the size of the samples on disk?
``` R version 3.6.0 (2019-04-26) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] MSnbase_2.12.0 ProtGenerics_1.18.0 S4Vectors_0.24.3 [4] mzR_2.20.0 Rcpp_1.0.4 Biobase_2.46.0 [7] BiocGenerics_0.32.0 loaded via a namespace (and not attached): [1] BiocManager_1.30.10 pillar_1.4.3 compiler_3.6.0 [4] plyr_1.8.6 iterators_1.0.12 zlibbioc_1.32.0 [7] digest_0.6.25 ncdf4_1.17 MALDIquant_1.19.3 [10] lifecycle_0.2.0 tibble_2.1.3 preprocessCore_1.48.0 [13] gtable_0.3.0 lattice_0.20-40 pkgconfig_2.0.3 [16] rlang_0.4.5 foreach_1.4.8 dplyr_0.8.5 [19] IRanges_2.20.2 grid_3.6.0 tidyselect_1.0.0 [22] glue_1.3.2 impute_1.60.0 R6_2.4.1 [25] XML_3.99-0.3 BiocParallel_1.20.1 limma_3.42.2 [28] ggplot2_3.3.0 purrr_0.3.3 magrittr_1.5 [31] scales_1.1.0 pcaMethods_1.78.0 codetools_0.2-16 [34] MASS_7.3-51.5 mzID_1.24.0 assertthat_0.2.1 [37] colorspace_1.4-1 affy_1.64.0 doParallel_1.0.15 [40] munsell_0.5.0 vsn_3.54.0 crayon_1.3.4 [43] affyio_1.56.0 ```
Is it ever supposed to exceed the size of the samples on disk?
this depends. The data on disk is compressed, once the m/z and intensity values are in memory they will need more memory. And then there is also the R property to copy objects instead of in-place replacement. This can result in an even larger memory demand then just having the data in memory once.
I see, that makes sense. I think this should be closed then. No problems when running with fewer jobs.
Thanks!
Hi! Apologies if this belongs to another repo.
I'm trying to extract TIC chromatograms from my 741 GC-MS samples (~32GB size on disk) from an
OnDiskMSnExp
object using thetic
function (initial=FALSE
) andBiocParallel::MulticoreParam
to parallelize (I am running on Linux). This fails either withError in result[[njob]]: attempt to select less than one element in OneIndex
or witherror writing to connection
but corresponds to the process(es) being killed by the OOM handler when the memory usage goes above 90GB (the limit I set to the job). I can get around this by doing the following: