sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
183 stars 80 forks source link

Multicore processing not using registered cores? #769

Open CLUES-Emory opened 1 week ago

CLUES-Emory commented 1 week ago

Hello, I've noticed an interesting result when trying to use multiple cores to process MSExperiment objects using findChromPeaks. Regardless of the number of cores I register (using register(bpstart(MulticoreParam(num_cores))); it seems to only use two cores for processing. I've replicated this on both a Mac M1 (using R Studio) and Linux cluster. You can see the number of cores being used for parallel processing below. I've also timed how long it takes to process 10 files using 4 and 8 registered cores on both a Mac and Linux cluster, and the times seem to be the same. Both systems had 10 cores available.

Mac 4 cores: 3.83 mins 8 cores: 3.81 mons

Linux 4 cores: 5.83 mins 8 cores: 5.88 mins

4 cores registered image

8 cores registered image

Thank you in advance!

Session info is below.

R version 4.4.0 (2024-04-24) Platform: x86_64-pc-linux-gnu Running under: Rocky Linux 8.10 (Green Obsidian)

Matrix products: default BLAS: /apps/R/4.4.0/lib64/R/lib/libRblas.so LAPACK: /apps/R/4.4.0/lib64/R/lib/libRlapack.so; LAPACK version 3.12.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York tzcode source: system (glibc)

attached base packages: [1] stats4 parallel splines stats graphics grDevices utils
[8] datasets methods base

other attached packages: [1] tibble_3.2.1 WaveICA_0.1.0 data.table_1.15.4
[4] msentropy_0.1.4 MsBackendMsp_1.8.0 Spectra_1.14.1
[7] S4Vectors_0.42.1 BiocGenerics_0.50.0 RAMClustR_1.3.0
[10] writexl_1.5.0 xMSanalyzer_2.0.6.1 WGCNA_1.72-5
[13] fastcluster_1.2.6 dynamicTreeCut_1.63-1 sva_3.52.0
[16] genefilter_1.86.0 mgcv_1.9-1 nlme_3.1-164
[19] doSNOW_1.0.20 RCurl_1.98-1.16 limma_3.60.4
[22] R2HTML_2.3.4 XML_3.99-0.17 apLCMS_6.6.9
[25] ROCS_1.3 poibin_1.5 ROCR_1.0-11
[28] randomForest_4.7-1.1 e1071_1.7-14 gbm_2.2.2
[31] snow_0.4-4 doParallel_1.0.17 iterators_1.0.14
[34] foreach_1.5.2 mzR_2.38.0 Rcpp_1.0.12
[37] rgl_1.3.1 MASS_7.3-60.2 gridExtra_2.3
[40] ggplot2_3.5.1 readxl_1.4.3 microbenchmark_1.4.10 [43] RColorBrewer_1.1-3 dplyr_1.1.4 MsExperiment_1.6.0
[46] ProtGenerics_1.36.0 xcms_4.2.2 BiocParallel_1.38.0

loaded via a namespace (and not attached): [1] bitops_1.0-8 cellranger_1.1.0
[3] preprocessCore_1.66.0 pROC_1.18.5
[5] rpart_4.1.23 lifecycle_1.0.4
[7] edgeR_4.2.1 lattice_0.22-6
[9] MultiAssayExperiment_1.30.3 backports_1.4.1
[11] magrittr_2.0.3 rmarkdown_2.26
[13] Hmisc_5.1-3 plsdepot_0.2.0
[15] MsCoreUtils_1.16.1 DBI_1.2.2
[17] abind_1.4-5 zlibbioc_1.50.0
[19] GenomicRanges_1.56.1 purrr_1.0.2
[21] AnnotationFilter_1.28.0 JADE_2.0-4
[23] nnet_7.3-19 GenomeInfoDbData_1.2.12
[25] IRanges_2.38.1 MSnbase_2.30.1
[27] annotate_1.82.0 ncdf4_1.22
[29] codetools_0.2-20 DelayedArray_0.30.1
[31] tidyselect_1.2.1 UCSC.utils_1.0.0
[33] matrixStats_1.3.0 base64enc_0.1-3
[35] jsonlite_1.8.8 Formula_1.2-5
[37] survival_3.5-8 tools_4.4.0
[39] progress_1.2.3 glue_1.7.0
[41] SparseArray_1.4.8 xfun_0.43
[43] MatrixGenerics_1.16.0 ggfortify_0.4.17
[45] GenomeInfoDb_1.40.1 withr_3.0.0
[47] BiocManager_1.30.22 fastmap_1.1.1
[49] fansi_1.0.6 digest_0.6.35
[51] R6_2.5.1 colorspace_2.1-0
[53] GO.db_3.19.1 RSQLite_2.3.7
[55] waveslim_1.8.5 utf8_1.2.4
[57] tidyr_1.3.1 generics_0.1.3
[59] corpcor_1.6.10 class_7.3-22
[61] prettyunits_1.2.0 PSMatch_1.8.0
[63] httr_1.4.7 htmlwidgets_1.6.4
[65] S4Arrays_1.4.1 scatterplot3d_0.3-44
[67] pkgconfig_2.0.3 gtable_0.3.5
[69] blob_1.2.4 impute_1.78.0
[71] MassSpecWavelet_1.70.0 XVector_0.44.0
[73] htmltools_0.5.8.1 MALDIquant_1.22.2
[75] clue_0.3-65 scales_1.3.0
[77] Biobase_2.64.0 png_0.1-8
[79] knitr_1.46 MetaboCoreUtils_1.12.0
[81] rstudioapi_0.16.0 reshape2_1.4.4
[83] checkmate_2.3.2 proxy_0.4-27
[85] cachem_1.0.8 stringr_1.5.1
[87] foreign_0.8-86 AnnotationDbi_1.66.0
[89] mzID_1.42.0 vsn_3.72.0
[91] pillar_1.9.0 grid_4.4.0
[93] vctrs_0.6.5 MsFeatures_1.12.0
[95] pcaMethods_1.96.0 xtable_1.8-4
[97] cluster_2.1.6 htmlTable_2.4.3
[99] evaluate_0.23 cli_3.6.2
[101] locfit_1.5-9.10 compiler_4.4.0
[103] rlang_1.1.3 crayon_1.5.2
[105] fdrtool_1.2.17 multitaper_1.0-17
[107] QFeatures_1.14.2 affy_1.82.0
[109] plyr_1.8.9 fs_1.6.4
[111] stringi_1.8.3 munsell_0.5.1
[113] Biostrings_2.72.1 lazyeval_0.2.2
[115] Matrix_1.7-0 hms_1.1.3
[117] bit64_4.0.5 KEGGREST_1.44.1
[119] statmod_1.5.0 SummarizedExperiment_1.34.0 [121] igraph_2.0.3 memoise_2.0.1
[123] affyio_1.74.0 bit_4.0.5

jorainer commented 1 week ago

Thanks for the details and the sessionInfo() output. Could you please also provide the code that you used to setup the parallel processing and how you called findChromPeaks()? And also, just to confirm, you're already using the new MsExperiment/XcmsExperiment objects, right (not the older OnDiskMSnExp/XCMSnExp)?

CLUES-Emory commented 1 week ago

Thanks! Yes, we used the new MsExperiment objects. In fact, switching to the MsExperiment objects may be linked to this issue. When we originally built our workflow using the OnDiskMSnExp objects, we were able to use multiple cores (we saw performance improvements until we reached 20 cores).

The multicores were setup using the following: register(bpstart(MulticoreParam(8)))

Files were read in using the following: ms_data<- readMsExperiment(spectraFiles = mzML_files)

Peak detection was performed using these parameters and the code below.

#Step 1 XCMS peak detection parameters
  xcms_params<-c()
  xcms_params$cwp_ppm= 5
  xcms_params$cwp_peakwidth= c(3,20)
  xcms_params$cwp_snthr= 5
  xcms_params$cwp_mzdiff= -0.001
  xcms_params$cwp_noise= 20000
  xcms_params$cwp_prefilter= c(5,20000)
  xcms_params$cwp_mzCenterFun= "wMean"
  xcms_params$cwp_integrate= 1
  xcms_params$cwp_fitgauss= FALSE
  xcms_params$cwp_extendLengthMSW=TRUE

  #Step 1, peak detection
  #Define CentWave parameterds
  cwp <- CentWaveParam(
    ppm= xcms_params$cwp_ppm,
    peakwidth= xcms_params$cwp_peakwidth,
    snthr= xcms_params$cwp_snthr,
    mzdiff= xcms_params$cwp_mzdiff,
    noise= xcms_params$cwp_noise,
    prefilter= xcms_params$cwp_prefilter,
    mzCenterFun= xcms_params$cwp_mzCenterFun,
    integrate= xcms_params$cwp_integrate,
    fitgauss= xcms_params$cwp_fitgauss,
    extendLengthMSW= xcms_params$cwp_extendLengthMSW)

  t1<-Sys.time()
  #Detect peaks using cwp
  step_1_res <- findChromPeaks(ms_data, param = cwp)

  Sys.time() - t1

We've also tried running with the BPPARAM specified in the findChromPeaks function, but no difference was seen. E.g. step_1_res <- findChromPeaks(ms_data, param = cwp, BPPARAM = MulticoreParam(8))