sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
186 stars 80 forks source link

Problem when using chromatogram function with huge dataset #358

Open melpetera opened 5 years ago

melpetera commented 5 years ago

Hi there, I am new in posting issues so please be kind if I do not provide enough information here.

We encountered a problem while using the 'chromatogram' function with huge dataset. Initially I got the following error while using the W4M Galaxy module based on XCMS dedicated to ploting TIC and BIC:

Error in serialize(data, node$con, xdr = FALSE) : error writing to connection Calls: chromatogram ... -> .local -> .extractMultipleChromatograms

Although I regularly use this module for various dataset with no problem, here it was the case of a particularly huge dataset. We are talking about 1576 samples with 8439 peaks per sample on average (that makes 13,300,226 peaks identified).

It just happened that other people encountered exactly the same problem on there data while I was investigating this: @sneumann and @MarrSue

Let's just go over what I tested and concluded for now, illustrating it with the dataset I got first, but I guess we would have same conclusion with the data used by @MarrSue

First attempt via W4M Galaxy module dedicated to ploting TIC and BIC The R conditions:

SESSION INFO R version 3.4.1 (2017-06-30) Main packages: RColorBrewer 1.1.2 batch 1.1.4 xcms 3.0.0 MSnbase 2.4.0 ProtGenerics 1.10.0 mzR 2.12.0 Rcpp 0.12.17 BiocParallel 1.12.0 Biobase 2.38.0 BiocGenerics 0.24.0
Other loaded packages: pillar 1.3.0 compiler 3.4.1 BiocInstaller 1.28.0 plyr 1.8.4 iterators 1.0.10 zlibbioc 1.24.0 MALDIquant 1.18 digest 0.6.16 preprocessCore 1.40.0 tibble 1.4.2 gtable 0.2.0 lattice 0.20.35 rlang 0.2.1 Matrix 1.2.14 foreach 1.4.4 S4Vectors 0.16.0 IRanges 2.12.0 multtest 2.34.0 stats4 3.4.1 grid 3.4.1 impute 1.52.0 survival 2.42.6 XML 3.98.1.16 RANN 2.6 limma 3.34.9 ggplot2 3.0.0 MASS 7.3.50 splines 3.4.1 scales 1.0.0 pcaMethods 1.70.0 codetools 0.2.15 MassSpecWavelet 1.44.0 mzID 1.16.0 colorspace 1.3.2 affy 1.56.0 lazyeval 0.2.1 munsell 0.5.0 doParallel 1.0.11 vsn 3.46.0 crayon 1.3.4 affyio 1.48.0

The error obtained is already given above. For information, the same error is obtained when trying to compute TIC and BIC via the module dedicated to retention time correction. This was awaited since this module also display TIC and BIC.

I tryed checking the validity of the data by trying to do other things with the same data. Finding peak groups is no problem (we can easily get something like 10,909 groups). Exporting data into a peak table is also ok. I concluded it was not due to corrupted data.

Second attempt: reproducing the error out of Galaxy I installed R on a Windows server we have in my workplace, for the machine is quite competitive in ressources (at least more than my laptop which could have suffer a little). Then I runned the code corresponding to TIC and BIC plot, directly taken from the corresponding Galaxy module script.

Here is the concerned line, xdata being the MSn experiment data: chromTIC <- chromatogram(xdata, aggregationFun = "sum")

Additionnal info - xdata:

MSn experiment data ("XCMSnExp") Object size in memory: 1043.62 Mb Spectra data - - - MS level(s): 1 Number of spectra: 4837618 MSn retention times: 0:0 - 32:3 minutes [...] mandatory privacy here protocolData: none featureData featureNames: F1.S0001 F1.S0002 ... F1576.S3194 (4837618 total) fvarLabels: fileIdx spIdx ... spectrum (28 total) fvarMetadata: labelDescription experimentData: use 'experimentData(object)' xcms preprocessing - - - Chromatographic peak detection: method: centWave 13300226 peaks identified in 1576 samples. On average 8439 chromatographic peaks per sample. Correspondence: method: chromatographic peak density 10909 features identified. Median mz range of features: 0.0045739 Median rt range of features: 80.827

Additionnal info - sessionInfo():

R version 3.4.1 (2017-06-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] RColorBrewer_1.1-2 xcms_3.0.2 MSnbase_2.4.2 ProtGenerics_1.10.0 mzR_2.12.0
[6] Rcpp_0.12.12 BiocParallel_1.12.0 Biobase_2.38.0 BiocGenerics_0.24.0

loaded via a namespace (and not attached): [1] pillar_1.3.1 compiler_3.4.1 BiocInstaller_1.28.0 plyr_1.8.4
[5] tools_3.4.1 iterators_1.0.10 zlibbioc_1.24.0 MALDIquant_1.18
[9] digest_0.6.18 tibble_2.0.1 preprocessCore_1.40.0 gtable_0.2.0
[13] lattice_0.20-35 pkgconfig_2.0.2 rlang_0.3.1 Matrix_1.2-10
[17] foreach_1.4.4 S4Vectors_0.16.0 IRanges_2.12.0 multtest_2.34.0
[21] stats4_3.4.1 grid_3.4.1 impute_1.52.0 survival_2.41-3
[25] XML_3.98-1.17 RANN_2.6.1 limma_3.34.9 ggplot2_3.1.0
[29] MASS_7.3-47 splines_3.4.1 scales_1.0.0 pcaMethods_1.70.0
[33] codetools_0.2-15 MassSpecWavelet_1.44.0 mzID_1.16.0 colorspace_1.4-0
[37] affy_1.56.0 lazyeval_0.2.1 munsell_0.5.0 doParallel_1.0.14
[41] vsn_3.46.0 crayon_1.3.4 affyio_1.48.0

And so, error reproduced with little additional info:

Error in serialize(data, node$con) : error writing to connection Error: failed to stop ‘SOCKcluster’ cluster: error writing to connection

I made some quick research about it. It seems it could be a problem while trying to do some parallele work (things like when you use foreach R package), that could in some circuntances leads to problem about memory usage or things like that. Truth is I am not too confortable with this kind of R topics, and in fact I am also very bad a researching info anyway so... I guess a little help here would be highly appreciated I should admit.

So any idea about what is happening here? I do not actually know how the chromatogram function works, so maybe you guys from xcms would better know if something is suspicious here.

Tagging my colleague: @lecorguille

sneumann commented 5 years ago

Very good investigation so far! Now I need an object to run chromTIC <- chromatogram(xdata, aggregationFun = "sum") locally. From Sue I got the Galaxy2633-[xset.merged.groupChromPeaks.RData].rdata.xcms.group which includes the xdata (see below). I'll try to reproduce here.

Yours, Steffen

> xdata
MSn experiment data ("XCMSnExp")
Object size in memory: 522.83 Mb
- - - Spectra data - - -
 MS level(s): 1 
 Number of spectra: 2342035 
 MSn retention times: 0:0 - 20:4 minutes
- - - Processing information - - -
Concatenated [Wed Feb 27 19:24:26 2019] 
 MSnbase version: 2.4.0 
- - - Meta data  - - -
phenoData
  rowNames: ./pos_128_2018_G_PHLPRA_A045_a_1-D,3_01_13482.mzML
    ./pos_128_2018_F_LEUVUL_A018_b_1-C,4_01_13505.mzML ...
    ./pos_QC_grass_2018_2-A,1_01_13512.mzML (655 total)
  varLabels: sample_name sample_group
  varMetadata: labelDescription
Loaded from:
  [1] pos_128_2018_G_PHLPRA_A045_a_1-D,3_01_13482.mzML...  [655] pos_QC_grass_2018_2-A,1_01_13512.mzML
  Use 'fileNames(.)' to see all files.
protocolData: none
featureData
  featureNames: F1.S0001 F1.S0002 ... F655.S3576 (2342035 total)
  fvarLabels: fileIdx spIdx ... spectrum (28 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
- - - xcms preprocessing - - -
Chromatographic peak detection:
 method: centWave 
 10408519 peaks identified in 655 samples.
 On average 15891 chromatographic peaks per sample.
Correspondence:
 method: chromatographic peak density 
 16907 features identified.
 Median mz range of features: 0.004791
 Median rt range of features: 16.356
sneumann commented 5 years ago

Great, can reproduce locally:

Error in serialize(data, node$con, xdr = FALSE) : 
  error writing to connection
Calls: source ... <Anonymous> -> .local -> .extractMultipleChromatograms
Execution halted

on

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] xcms_3.3.3          MSnbase_2.6.1       ProtGenerics_1.12.0
[4] mzR_2.14.0          Rcpp_0.12.17        BiocParallel_1.14.2
[7] Biobase_2.40.0      BiocGenerics_0.26.0
sneumann commented 5 years ago

... and indeed no issue when running serial:

> register(SerialParam())
> ...
> chromTIC <- chromatogram(xdata, aggregationFun = "sum")
> 

That's the first time I saw TB as unit for memory usage (albeit a small number ...).

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
   19 root      20   0 12.479g 0.010t   2820 R  81.4 17.1   7:04.94 R           
   20 root      20   0 12.444g 0.010t   2820 S  75.1 17.1   6:49.62 R           

On our system MulticoreParam(2) still works, while the above failure was with all cores. I guess the max. memory usage is a function of the number of samples that have to be loaded simultaneously (or number of features ? Unsure).

Final object has

> print(object.size(xdata), units="MB")
522.8 Mb
> print(object.size(chromTIC), units="MB")
359.7 Mb

not sure if memory usage could be reduced during parallel processing.

Yours, Steffen

jorainer commented 5 years ago

This is interesting. If it was indeed due to a memory limit I would however expect a different error message. I got these messages when I run out of forks on macOS (I wasn't aware that there is such a limit though). That's also a reason why I like to pre-register all cores before running anything with xcms - each parallel processing step re-uses then the same processes.

@sneumann , do you still get the same error if you do

register(bpstart(MulticoreParam()))

before the chromatogram extraction?

Regarding the memory usage: the chromatogram function will only read those spectra matching the EIC's retention time range from the original files. In the worst case it would read the full data of a file in each parallel process.

melpetera commented 5 years ago

To confirm a way to have a quick fix for this matter, I also tested the register(SerialParam()) and indeed obtained my results without problem, as @sneumann did. I could not test register(bpstart(MulticoreParam())) since I'm running R on Windows. I tested register(bpstart(SnowParam())) just to see, but still got an error writing to connection.

jorainer commented 5 years ago

Thanks for reporting @melpetera - and what happens if you use register(bpstart(SnowParam(2))) - just limiting to two parallel processes?

melpetera commented 5 years ago

Tested, and still got the main error

Error in serialize(data, node$con) : error writing to connection

but without the notice of

Error: failed to stop ‘SOCKcluster’ cluster: error writing to connection

jorainer commented 5 years ago

This sounds a little like problems with parallel processing on that particular Windows machine. AFAIK in snow/sock-based parallel processing the master process talks to the slave processes via sockets and might need to get network access. I've seen sometimes that the firewall or something is preventing this.

sneumann commented 5 years ago

Hi @melpetera , could you confirm that the original error was reported in W4M Ticket#2019030210000026 ? Because then it should be debugged on that infrastructure, since Windows brings in quite a bit of additional/other challenges, and solutions could be different. Yours, Steffen