sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
180 stars 80 forks source link

biocParallel error on cluster but not locally #299

Open tentrillion opened 6 years ago

tentrillion commented 6 years ago

I use xcms3 usually via a Jupyter notebook interface using R 3.4.1. All components are installed with conda, from either conda-forge or bioconductor channels.

Code and error message

On my laptop (Mac OS X), xcms3 works fine and is awesome! (Thanks!). I recently ported over my setup to my university's shared supercomputing cluster, and code that was working for me before doesn't anymore.

>>> my_files
   'mzml_files/24_10min_labeled_c18_pos.mzML' 'mzml_files/25_30min_labeled_c18_pos.mzML' 
   'mzml_files/26_60min_labeled_c18_pos.mzML' 'mzml_files/27_90min_labeled_c18_pos.mzML' 
   'mzml_files/28_10min_unlab_c18_pos.mzML' 'mzml_files/29_30min_unlab_c18_pos.mzML' 
   'mzml_files/30_60min_unlab_c18_pos.mzML' 'mzml_files/31_90min_unlab_c18_pos.mzML'

>>> xset <- 
    xcmsSet(my_files, 
            method = 'centWave',
            ppm = 10,
            mzdiff = -0.001,
            peakwidth = c(7.5, 25)
           )
   Error in value[[3L]](cond): setting worker timeout:
     error reading from connection
   Traceback:

   1. xcmsSet(my_files, method = "centWave", ppm = 10, mzdiff = -0.001, 
    .     peakwidth = c(7.5, 25))
   2. bptry(bplapply(argList, findPeaksPar, BPPARAM = BPPARAM))
   3. tryCatch(expr, ..., bplist_error = bplist_error, bperror = bperror)
   4. tryCatchList(expr, classes, parentenv, handlers)
   5. tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), 
    .     names[nh], parentenv, handlers[[nh]])
   6. doTryCatch(return(expr), name, parentenv, handler)
   7. tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
   8. tryCatchOne(expr, names, parentenv, handlers[[1L]])
   9. doTryCatch(return(expr), name, parentenv, handler)
   10. bplapply(argList, findPeaksPar, BPPARAM = BPPARAM)
   11. bplapply(argList, findPeaksPar, BPPARAM = BPPARAM)
   12. bpstart(BPPARAM, length(X))
   13. bpstart(BPPARAM, length(X))
   14. .local(x, ...)
   15. tryCatch({
     .     timeout <- bptimeout(x)
     .     if (is.finite(timeout)) {
     .         parallel::clusterExport(bpbackend(x), "timeout", env = environment())
     .     }
     . }, error = function(err) {
     .     bpstop(x)
     .     stop("setting worker timeout:\n  ", conditionMessage(err))
     . })
   16. tryCatchList(expr, classes, parentenv, handlers)
   17. tryCatchOne(expr, names, parentenv, handlers[[1L]])
   18. value[[3L]](cond)
   19. stop("setting worker timeout:\n  ", conditionMessage(err))

sessionInfo()

Here's a result of sessionInfo() on the cluster:

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /home/groups/khosla/curtf/anaconda3/envs/xcms/lib/R/lib/libRblas.so
LAPACK: /home/groups/khosla/curtf/anaconda3/envs/xcms/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] viridis_0.5.1            viridisLite_0.3.0        stringr_1.3.1           
 [4] dplyr_0.7.4              purrr_0.2.4              readr_1.1.1             
 [7] tidyr_0.8.1              tibble_1.4.2             ggplot2_2.2.1           
[10] tidyverse_1.1.1          parse.agilent_0.1.0.9000 xcms_3.0.0              
[13] MSnbase_2.4.0            ProtGenerics_1.10.0      mzR_2.12.0              
[16] Rcpp_0.12.15             BiocParallel_1.10.1      Biobase_2.38.0          
[19] BiocGenerics_0.24.0     

loaded via a namespace (and not attached):
 [1] nlme_3.1-131           lubridate_1.7.4        doParallel_1.0.11     
 [4] RColorBrewer_1.1-2     httr_1.3.1             repr_0.15.0           
 [7] tools_3.4.1            R6_2.2.2               affyio_1.48.0         
[10] lazyeval_0.2.1         colorspace_1.3-2       gridExtra_2.3         
[13] mnormt_1.5-5           compiler_3.4.1         MassSpecWavelet_1.44.0
[16] preprocessCore_1.40.0  rvest_0.3.2            xml2_1.2.0            
[19] scales_0.5.0           psych_1.8.4            affy_1.56.0           
[22] pbdZMQ_0.3-2           digest_0.6.15          foreign_0.8-67        
[25] base64enc_0.1-3        pkgconfig_2.0.1        htmltools_0.3.6       
[28] limma_3.34.9           rlang_0.2.1            readxl_1.1.0          
[31] impute_1.52.0          BiocInstaller_1.28.0   bindr_0.1.1           
[34] jsonlite_1.5           mzID_1.16.0            magrittr_1.5          
[37] MALDIquant_1.17        Matrix_1.2-14          IRkernel_0.8.12       
[40] munsell_0.5.0          S4Vectors_0.16.0       vsn_3.46.0            
[43] stringi_1.2.3          MASS_7.3-48            zlibbioc_1.24.0       
[46] plyr_1.8.4             grid_3.4.1             forcats_0.3.0         
[49] crayon_1.3.4           lattice_0.20-34        IRdisplay_0.4.4       
[52] haven_1.1.2            splines_3.4.1          multtest_2.34.0       
[55] hms_0.3                pillar_1.2.2           uuid_0.1-2            
[58] reshape2_1.4.3         codetools_0.2-15       stats4_3.4.1          
[61] XML_3.98-1.6           glue_1.2.0             evaluate_0.10.1       
[64] pcaMethods_1.70.0      modelr_0.1.2           foreach_1.4.4         
[67] cellranger_1.1.0       gtable_0.2.0           RANN_2.5.1            
[70] assertthat_0.2.0       broom_0.4.4            survival_2.40-1       
[73] iterators_1.0.9        IRanges_2.12.0         bindrcpp_0.2    

Speculation

The error message seems to be from https://github.com/Bioconductor/BiocParallel/blob/master/R/SnowParam-class.R

I don't know what bpstop and bptimeout are or enough about how BiocParallel works to say any more. Is it possible to turn off parallelization as a quick-and-dirty fix? I couldn't find documentation for v3.0.0 versions of xcmsSet() to see if that is possible.

If this is really an issue with BiocParallel, let me know and I will try to file an issue over there.

tentrillion commented 6 years ago

I just forced an upgrade to newer versions of xcms, biocParallel, and MSnbase by using standard bioconductor installations rather than relying on conda. (This has caused lots of problems for me in the past, but once or twice it has also solved problems too.)

The error message from the same code is now different:

New error message

Error: 'bplapply' receive data failed:
  error reading from connection
Traceback:

1. xcmsSet(my_files, method = "centWave", ppm = 10, mzdiff = -0.001, 
 .     peakwidth = c(7.5, 25))

New sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /home/groups/khosla/curtf/anaconda3/envs/xcms/lib/R/lib/libRblas.so
LAPACK: /home/groups/khosla/curtf/anaconda3/envs/xcms/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] viridis_0.5.1            viridisLite_0.3.0        stringr_1.3.1           
 [4] dplyr_0.7.4              purrr_0.2.4              readr_1.1.1             
 [7] tidyr_0.8.1              tibble_1.4.2             ggplot2_2.2.1           
[10] tidyverse_1.1.1          parse.agilent_0.1.0.9000 xcms_3.0.2              
[13] MSnbase_2.4.2            ProtGenerics_1.10.0      mzR_2.12.0              
[16] Rcpp_0.12.17             BiocParallel_1.12.0      Biobase_2.38.0          
[19] BiocGenerics_0.24.0     

loaded via a namespace (and not attached):
 [1] nlme_3.1-131           lubridate_1.7.4        doParallel_1.0.11     
 [4] RColorBrewer_1.1-2     httr_1.3.1             repr_0.15.0           
 [7] tools_3.4.1            R6_2.2.2               affyio_1.48.0         
[10] lazyeval_0.2.1         colorspace_1.3-2       gridExtra_2.3         
[13] mnormt_1.5-5           compiler_3.4.1         MassSpecWavelet_1.44.0
[16] preprocessCore_1.40.0  rvest_0.3.2            xml2_1.2.0            
[19] scales_0.5.0           psych_1.8.4            affy_1.56.0           
[22] pbdZMQ_0.3-2           digest_0.6.15          foreign_0.8-67        
[25] base64enc_0.1-3        pkgconfig_2.0.1        htmltools_0.3.6       
[28] limma_3.34.9           rlang_0.2.1            readxl_1.1.0          
[31] impute_1.52.0          BiocInstaller_1.28.0   bindr_0.1.1           
[34] jsonlite_1.5           mzID_1.16.0            magrittr_1.5          
[37] MALDIquant_1.17        Matrix_1.2-14          IRkernel_0.8.12       
[40] munsell_0.5.0          S4Vectors_0.16.0       vsn_3.46.0            
[43] stringi_1.2.3          MASS_7.3-48            zlibbioc_1.24.0       
[46] plyr_1.8.4             grid_3.4.1             forcats_0.3.0         
[49] crayon_1.3.4           lattice_0.20-34        IRdisplay_0.4.4       
[52] haven_1.1.2            splines_3.4.1          multtest_2.34.0       
[55] hms_0.3                pillar_1.2.2           uuid_0.1-2            
[58] reshape2_1.4.3         codetools_0.2-15       stats4_3.4.1          
[61] XML_3.98-1.6           glue_1.2.0             evaluate_0.10.1       
[64] pcaMethods_1.70.0      modelr_0.1.2           foreach_1.4.4         
[67] cellranger_1.1.0       gtable_0.2.0           RANN_2.5.1            
[70] assertthat_0.2.0       broom_0.4.4            survival_2.40-1       
[73] iterators_1.0.9        IRanges_2.12.0         bindrcpp_0.2          
jorainer commented 6 years ago

Hi @tentrillion, good that you find xcms version >= 3 useful!

Now, the error you see is related to the parallel processing setup. I suggest you try to initialize the parallel processing right after loading the libraries in all of your scripts. At least on MacOS and Windows that helps to resolve similar problems. For unix-based systems I suggest you use:

library(xcms)
register(bpstart(MulticoreParam(3)))

The important point seems to be to used bpstart so that you use the same forks/parallel processes throughout your analysis. Without that BiocParallel used to initialize new parallel processes for each call.

Note that one a high performance cluster you might have a queuing system setup (such as SLURM), so for that there are also other possibilities. Have a look at the BiocParallel vignette for more details.

curt-f commented 6 years ago

My cluster is indeed a SLURM cluster. I tried what you suggested, and the good news is that my code spent much longer running. The bad news is it stilled errorred out, this time with a different error message.

I've been reviewing the BiocParallel vignette and am grateful for the pointer. It does seem like this is a BiocParallel issue instead of an xcms issue. However I'm not quite sure how to follow the examples in the vignette, because xcms3 is already higher-level than e.g. the example 4.3.1 in the vignette. There is no function that I can write that would be mapped over my mzML files, for example, right?

In any case, I'm happy to close this issue for now but will leave it up to you all. Examples of successful use of xcms3 on a SLURM system would be very much welcome, but is really a separate issue.

jorainer commented 6 years ago

Eventually you might switch over to the new functions (see http://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html#3_initial_data_inspection). The new workflow is that you load your mzML files with the readMSData function from the MSnbase package (make sure to set the parameter mode = "onDisk"). This enables you to get access to the raw mzML data in a better way than was possible with the xcmsSet and xcmsRaw functions.

To sort out whether the error that you get is related to parallel processing or due to xcms problems you can set register(SerialParam()) to disable parallel processing completely. Sometimes the actual error message gets not reported with parallel processing.

trivedi-group commented 5 years ago

Eventually you might switch over to the new functions (see http://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html#3_initial_data_inspection). The new workflow is that you load your mzML files with the readMSData function from the MSnbase package (make sure to set the parameter mode = "onDisk"). This enables you to get access to the raw mzML data in a better way than was possible with the xcmsSet and xcmsRaw functions.

To sort out whether the error that you get is related to parallel processing or due to xcms problems you can set register(SerialParam()) to disable parallel processing completely. Sometimes the actual error message gets not reported with parallel processing.

I am in almost the same situation with regards to OS. CentOS doesn't work. Errors with netCDF but on MacOS it works fine. The only difference is with or without parallel processing, I still get the same error. Was this issue resolved for tentrillion?

tentrillion commented 5 years ago

While of course I would be thrilled if the XCMS development team came out with a detailed tutorial for using XCMS on Slurm-based systems, I don't think my issue was really a true "bug" in XCMS. The underlying issue seems to be with biocParallel.

[Just noticed that I used two different GitHub accounts for comments here. curt-f and tentrillion are both me -- one is an academic account and one is general use. Apologies for the confusion.]

trivedi-group commented 5 years ago

While of course I would be thrilled if the XCMS development team came out with a detailed tutorial for using XCMS on Slurm-based systems, I don't think my issue was really a true "bug" in XCMS. The underlying issue seems to be with biocParallel.

[Just noticed that I used two different GitHub accounts for comments here. curt-f and tentrillion are both me -- one is an academic account and one is general use. Apologies for the confusion.]

Ah ok, thanks for responding! Unfortunately for my data, without parallel is not an option. Even the cluster runs out of the 1TB RAM when I try to use xcms without parallel processing.