sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
180 stars 80 forks source link

XCMS 1.46.0 - Dimensions of profile matrices do not match #194

Open eschen42 opened 7 years ago

eschen42 commented 7 years ago

Full disclosure: We have an instrument that may be having some issues.

That said, I have looked at this data, and I have not found zero-variance rows or columns in the dataMatrix (when substituting zero for missing values). But, at the retcor step, this fails with the message:

Dimensions of profile matrices do not match

Below are the particulars (pasted from the Galaxy history). Please let me know if additional information would be helpful.

XCMS version

1.46.0

Original mzML files

https://esch0041-drop.s3.msi.umn.edu/batch2_minipool.zip

Input to failing retcor step

https://esch0041-drop.s3.msi.umn.edu/batch2_minipool.xset.group.RData

Error message for retcor

Error in .local(object, ...) : Dimensions of profile matrices do not match ! Calls: do.call ... do.call -> retcor.obiwarp -> retcor.obiwarp -> .local Execution halted

Additional parameters

PACKAGE INFO parallel 3.3.2 BiocGenerics 0.20.0 Biobase 2.34.0 Rcpp 0.12.8 mzR 2.6.3 xcms 1.46.0 snow 0.4.1 batch 1.1.4

ARGUMENTS INFO image /export/galaxy-central/database/files/000/dataset_7.dat xfunction retcor xsetRdataOutput /export/galaxy-central/database/files/000/dataset_12.dat ticspdf /export/galaxy-central/database/files/000/dataset_13.dat bicspdf /export/galaxy-central/database/files/000/dataset_14.dat rplotspdf None method obiwarp profStep 0.9

INFILE PROCESSING INFO

ARGUMENTS PROCESSING INFO files_root_directory batch2_minipool Compute md5 checksum...

MAIN PROCESSING INFO COMPUTE center sample: pool_both02_GB1_01_639 Processing: pool_both01_GB1_01_637 pool_both03_GB1_01_645 pool_both04_GB1_01_652

Previous steps in pipeline

xcmsSet

PACKAGE INFO

parallel 3.3.2 BiocGenerics 0.20.0 Biobase 2.34.0 Rcpp 0.12.8 mzR 2.6.3 xcms 1.46.0 snow 0.4.1 batch 1.1.4

ARGUMENTS INFO zipfile /export/galaxy-central/database/files/000/dataset_1.dat xfunction xcmsSet xsetRdataOutput /export/galaxy-central/database/files/000/dataset_2.dat sampleMetadataOutput /export/galaxy-central/database/files/000/dataset_3.dat ticspdf /export/galaxy-central/database/files/000/dataset_4.dat bicspdf /export/galaxy-central/database/files/000/dataset_5.dat scanrange c(50, 1000) nSlaves 1 method centWave ppm 500 peakwidth c(2.5, 10) mzdiff 0.5 snthresh 10 integrate 1 noise 2000 prefilter c(3, 100)

xcmsGroup

PACKAGE INFO parallel 3.3.2 BiocGenerics 0.20.0 Biobase 2.34.0 Rcpp 0.12.8 mzR 2.6.3 xcms 1.46.0 snow 0.4.1 batch 1.1.4

ARGUMENTS INFO xfunction group image /export/galaxy-central/database/files/000/dataset_2.dat xsetRdataOutput /export/galaxy-central/database/files/000/dataset_7.dat rplotspdf /export/galaxy-central/database/files/000/dataset_8.dat method density minfrac 0 bw 4 mzwid 1.8 sleep 0.001 max 25 variableMetadataOutput /export/galaxy-central/database/files/000/dataset_9.dat dataMatrixOutput /export/galaxy-central/database/files/000/dataset_10.dat convertRTMinute FALSE numDigitsMZ 4 numDigitsRT 2 intval into

jorainer commented 7 years ago

I remember vaguely that there was some thing I found problematic in retcor.obiwarp that might have caused such an error. Obiwarp is performed on the profile matrix and the way that matrix was calculated was changed in more recent versions of xcms. Would be nice to run the analysis on a more recent xcms - unfortunately I can't download the original files from the link you provided.

eschen42 commented 7 years ago

@jotsetung Thank you. I jiggled the permissions, and now I can download it. Sorry for not confirming that beforehand.

I have centroided data. Is obiwarp more appropriate for profile data than centroid, i.e., should I be using peakGroups?

jorainer commented 7 years ago

Thanks, I'll check. Re centroided data, that's fine for obiwarp.

jorainer commented 7 years ago

You could try to use a different profStep value. profStep = 1 or even profStep = 0.91 would work on the data set you provide (using xcms 1.52.0 of current Bioconductor version 3.5). My sessionInfo:

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin16.6.0/x86_64 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] xcms_1.52.0         MSnbase_2.2.0       ProtGenerics_1.8.0 
[4] mzR_2.10.0          Rcpp_0.12.12        BiocParallel_1.10.1
[7] Biobase_2.36.2      BiocGenerics_0.22.0

loaded via a namespace (and not attached):
 [1] compiler_3.4.0         BiocInstaller_1.26.0   RColorBrewer_1.1-2    
 [4] plyr_1.8.4             tools_3.4.0            iterators_1.0.8       
 [7] zlibbioc_1.22.0        MALDIquant_1.16.2      digest_0.6.12         
[10] preprocessCore_1.38.1  tibble_1.3.3           gtable_0.2.0          
[13] lattice_0.20-35        rlang_0.1.1            Matrix_1.2-10         
[16] foreach_1.4.3          S4Vectors_0.14.3       IRanges_2.10.2        
[19] stats4_3.4.0           multtest_2.32.0        grid_3.4.0            
[22] impute_1.50.1          survival_2.41-3        XML_3.98-1.9          
[25] RANN_2.5.1             limma_3.32.3           ggplot2_2.2.1         
[28] MASS_7.3-47            splines_3.4.0          scales_0.4.1          
[31] pcaMethods_1.68.0      codetools_0.2-15       MassSpecWavelet_1.42.0
[34] mzID_1.14.0            colorspace_1.3-2       affy_1.54.0           
[37] lazyeval_0.2.0         munsell_0.4.3          doParallel_1.0.10     
[40] vsn_3.44.0             affyio_1.46.0
mheiser-md commented 7 years ago

Might be related / helpful? https://github.com/rietho/IPO/issues/43#issuecomment-261530355 and #80

jorainer commented 7 years ago

This https://github.com/sneumann/xcms/issues/196 might also be related - so no excuse anymore to not look into it ;)

jorainer commented 7 years ago

OK, the problem here is different from #196: the scanrange parameter is the problem here. Seems the original obiwarp code did not manage to handle xcmsSet objects with a scanrange properly.

jorainer commented 7 years ago

In addition, depending on the value of profStep, the profile matrices are not correctly extended to have the same number of rows (i.e. the same mz range). I'll fix that.

jorainer commented 7 years ago

Now the fix is in:

> library(xcms)
> register(SerialParam())
> load("./batch2_minipool.xset.group.RData")
> res <- retcor.obiwarp(xset, profStep = 0.9)
center sample:  pool_both02_GB1_01_639 
Processing: pool_both01_GB1_01_637  Create profile matrix with method 'bin' and step 0.9 ... OK
Create profile matrix with method 'bin' and step 0.9 ... OK
pool_both03_GB1_01_645  Create profile matrix with method 'bin' and step 0.9 ... OK
Create profile matrix with method 'bin' and step 0.9 ... OK
pool_both04_GB1_01_652  Create profile matrix with method 'bin' and step 0.9 ... OK
Create profile matrix with method 'bin' and step 0.9 ... OK

> ## OK. Now extract ion chromatograms to check.
> rtr <- c(80, 500)
> eics <- getEIC(res, mzrange = matrix(c(60, 1000), nrow = 1),
+          rtrange = matrix(rtr, nrow = 1), rt = "raw")
Create profile matrix with method 'bin' and step 0.1 ... OK
Create profile matrix with method 'bin' and step 0.1 ... OK
Create profile matrix with method 'bin' and step 0.1 ... OK
Create profile matrix with method 'bin' and step 0.1 ... OK
> eics_obi <- getEIC(res, mzrange = matrix(c(60, 1000), nrow = 1),
+          rtrange = matrix(rtr, nrow = 1), rt = "corrected")
Create profile matrix with method 'bin' and step 0.1 ... OK
Applying retention time correction to pool_both01_GB1_01_637.mzML.
Create profile matrix with method 'bin' and step 0.1 ... OK
No need to perform retention time correction, raw and corrected rt are identical for pool_both02_GB1_01_639.mzML.
Create profile matrix with method 'bin' and step 0.1 ... OK
Applying retention time correction to pool_both03_GB1_01_645.mzML.
Create profile matrix with method 'bin' and step 0.1 ... OK
Applying retention time correction to pool_both04_GB1_01_652.mzML.
> par(mfrow = c(2, 1))
> plot(eics)
> plot(eics_obi)

And the BPC with the raw (top) and adjusted retention times (bottom) for rt from 80 to 500: rplot

The fix is included in the current xcms version in the xcms3 branch. @eschen42 If you want to try that:

library(devtools)
install_github("sneumann/xcms", ref = "xcms3")
## Needs also more recent MSnbase:
install_github("lgatto/MSnbase")

@mheiser-md @rietho eventually this might also fix https://github.com/rietho/IPO/issues/43 (I hope). Note that the new user interface has a better implementation avoiding the quite complicated preparations of the matrices.

Please let me know so I can (hopefully) close the issue.

mheiser-md commented 7 years ago

Hi @jotsetung , thanks for the update. I grabbed xcms3 and tested it out:

> mzdatapath <- system.file("mzData", package = "mtbls2")
> mzdatafiles <- list.files(mzdatapath, recursive = TRUE, full.names=TRUE)
>
> xset <- xcmsSet(mzdatafiles[1:12],  method = "centWave",  peakwidth = c(6.65, 25), ppm = 25,
+   mzdiff = -0.001, snthresh = 10, noise = 10000, prefilter = c(3, 100)
+ )
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 466 regions of interest ... OK: 414 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 465 regions of interest ... OK: 418 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 466 regions of interest ... OK: 417 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 470 regions of interest ... OK: 420 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 559 regions of interest ... OK: 495 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 532 regions of interest ... OK: 480 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 543 regions of interest ... OK: 485 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 522 regions of interest ... OK: 470 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 555 regions of interest ... OK: 490 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 568 regions of interest ... OK: 503 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 543 regions of interest ... OK: 483 found.
Detecting mass traces at 25 ppm ... OK
Detecting chromatographic peaks in 541 regions of interest ... OK: 483 found.
> paramsRG <- getDefaultRetGroupStartingParams()
> resultRG <- optimizeRetGroup(xset, paramsRG, nSlaves=1)

starting new DoE with:

distFunc: cor_opt
gapInit: c(0, 0.4)
gapExtend: c(2.1, 2.7)
profStep: c(0.7, 1)
plottype: none
response: 1
factorDiag: 2
factorGap: 1
localAlignment: 0
retcorMethod: obiwarp
bw: c(22, 38)
minfrac: c(0.3, 0.7)
mzwid: c(0.015, 0.035)
minsamp: 1
max: 50
center: 10

center sample:  MSpos-Ex2-Col0-48h-Ag-2_1-A,3_01_9829
Processing: MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-Col0-48h-Ag-3_1-A,1_01_9822  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-Col0-48h-Ag-4_1-A,1_01_9824  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-cyp79-48h-Ag-1_1-B,1_01_9819  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
Error in .local(object, ...) :
  Dimensions of profile matrices do not match !

Processing 102212 mz slices ... OK
center sample:  MSpos-Ex2-Col0-48h-Ag-2_1-A,3_01_9829
Processing: MSpos-Ex1-Col0-48h-Ag-1_1-A,1_01_9818  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-Col0-48h-Ag-2_1-A,1_01_9820  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-Col0-48h-Ag-3_1-A,1_01_9822  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-Col0-48h-Ag-4_1-A,1_01_9824  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
MSpos-Ex1-cyp79-48h-Ag-1_1-B,1_01_9819  Create profile matrix with method 'bin' and step 0.7 ... OK
Create profile matrix with method 'bin' and step 0.7 ... OK
Error in .local(object, ...) :
  Dimensions of profile matrices do not match !

Looks like it still occurs. Here is my session info:

Session info ------------------------------------------------------------------
 setting  value
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 tz       Etc/UTC
 date     2017-08-11

Packages ----------------------------------------------------------------------
 package         * version  date       source
 CAMERA          * 1.32.0   2017-08-11 Bioconductor
 IPO             * 1.2.1    2017-08-11 Bioconductor
 MSnbase         * 2.3.8    2017-08-11 Github (lgatto/MSnbase@44c0356)
 mzR             * 2.10.0   2017-08-11 Bioconductor
 Rcpp            * 0.12.12  2017-07-15 cran (@0.12.12)
 xcms            * 2.99.6   2017-08-11 Github (sneumann/xcms@d3f47ca)
jorainer commented 7 years ago

bummer - @mheiser-md would it be possible for you to break that down to e.g. 2 or 3 of your files and share them with me so I can check?

mheiser-md commented 7 years ago

I used the files from the mtbls2 package on bioconductor:

source("https://bioconductor.org/biocLite.R")
biocLite("mtbls2")

mzdatapath <- system.file("mzData", package = "mtbls2")
mzdatafiles <- list.files(mzdatapath, recursive = TRUE, full.names=TRUE)
sneumann commented 7 years ago

Hi, I can confirm the issue with this snippet:

library(IPO)
xset <- xcmsSet(mzdatafiles[1:12],  method = "centWave",  peakwidth = c(6.65, 25), ppm = 25,mzdiff = -0.001, snthresh = 10, noise = 10000, prefilter = c(3, 100), nSlaves=4)
paramsRG <- getDefaultRetGroupStartingParams()
resultRG <- optimizeRetGroup(xset, paramsRG, nSlaves=1)

where some/half of the xsets created during optimizeRetGroup exhibit the error. Now we need to find a way to stop during such an error, since IPO uses trycatch to ignore errors ...

yufree commented 6 years ago

Recently I have the same issue for this error. However, finally I found one of the samples had ~900s data while all the other samples have ~2400s. After removing that sample, no error. So this scenario might happen when someone(like me) didn't check the samples' size before analysis. It might be helpful to throw a warning or error when few data show a much smaller retention time range than other samples in the very beginning. Of course, those kinds of errors should be avoided by a carefully check. Just mention here in case someone made a similar error as me.

lauzikaite commented 6 years ago

Hi,

Has there been an update for this issue? I am using xcms3 and receive the same error with all of my datasets (also receive this issue with xcms2.99).

Here is my session info.

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 12 SP1

Matrix products: default
BLAS: /apps/R/3.4.0/lib64/R/lib/libRblas.so
LAPACK: /apps/R/3.4.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] xcms_3.0.0          MSnbase_2.4.0       ProtGenerics_1.10.0
[4] mzR_2.12.0          Rcpp_0.12.13        BiocParallel_1.12.0
[7] Biobase_2.38.0      BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] compiler_3.4.0         BiocInstaller_1.28.0   RColorBrewer_1.1-2
 [4] plyr_1.8.4             iterators_1.0.8        zlibbioc_1.24.0
 [7] MALDIquant_1.16.4      digest_0.6.12          preprocessCore_1.40.0
[10] tibble_1.3.4           gtable_0.2.0           lattice_0.20-35
[13] rlang_0.1.4            Matrix_1.2-11          foreach_1.4.3
[16] S4Vectors_0.16.0       IRanges_2.12.0         stats4_3.4.0
[19] multtest_2.34.0        grid_3.4.0             impute_1.52.0
[22] survival_2.41-3        XML_3.98-1.9           RANN_2.5.1
[25] limma_3.34.0           ggplot2_2.2.1          MASS_7.3-47
[28] splines_3.4.0          scales_0.5.0           pcaMethods_1.70.0
[31] codetools_0.2-15       MassSpecWavelet_1.44.0 mzID_1.16.0
[34] colorspace_1.3-2       affy_1.56.0            lazyeval_0.2.1
[37] munsell_0.4.3          doParallel_1.0.11      vsn_3.46.0
[40] affyio_1.48.0

Previous steps:

xset <- xcmsSet(files = files,
                method = "centWave",
                peakwidth       = c(3, 10),
                ppm             = 20,
                noise           = 600,
                snthresh        = 5,
                mzdiff          = 0.005,
                prefilter       = c(10, 3000),
                mzCenterFun     = "wMean",
                integrate       = 2,
                fitgauss        = FALSE,
                verbose.columns = FALSE,
                BPPARAM = MulticoreParam(workers = bw))

gset <- group(xset,
              method = "density",
              minfrac = 0,
              minsamp = 0,
              bw = 2,
              mzwid = 0.01)

rset <- retcor(gset,
               method = "obiwarp",
               plottype = "none") 
sneumann commented 6 years ago

Hi lauzikaite, what is the traceback() after the error ? Also, try to "debug" your xcmsSet:

library(xcms)
library(mtbls2)
data(mtbls2)
t(sapply(mtbls2Set@rt$raw, range))
table(peaks(mtbls2Set)[,"sample"])

to see if any "empty" files without peaks are there.

yours, Steffen

lauzikaite commented 6 years ago

Hi @sneumann,

Below is the traceback() for this error:

Error in .local(object, ...) : 
  Dimensions of profile matrices do not match !
Calls: tryCatch ... do.call -> retcor.obiwarp -> retcor.obiwarp -> .local
11: (function () 
    traceback(2))()
10: stop("Dimensions of profile matrices do not match !\n")
9: .local(object, ...)
8: retcor.obiwarp(object, ...)
7: retcor.obiwarp(object, ...)
6: do.call(method, alist(object, ...))
5: .local(object, ...)
4: retcor(gset, method = "obiwarp", plottype = "none")
3: retcor(gset, method = "obiwarp", plottype = "none")
2: tryCatchList(expr, classes, parentenv, handlers)
1: tryCatch({
       rset <- retcor(gset, method = "obiwarp", plottype = "none")
   })

There are no empty files without peaks, though the number of scans and peaks per sample are varied:

> gset <- group(xset,
              method = "density",
              minfrac = 0,
              minsamp = 0,
              bw = 2,
              mzwid = 0.01)
> rn <- t(sapply(gset@rt$raw, range))
> summary(rn) # range of retention times
       V1              V2       
 Min.   :1.357   Min.   :749.2  
 1st Qu.:1.357   1st Qu.:749.4  
 Median :1.357   Median :749.4  
 Mean   :1.357   Mean   :749.4  
 3rd Qu.:1.357   3rd Qu.:749.4  
 Max.   :1.359   Max.   :749.4  

> rn <- t(sapply(gset@rt$raw, length))
> range(rn) # range of number of scans
[1] 4313 4334

> tb <- table(peaks(gset)[,"sample"])
> all(names(tb) == 1:nrow(rn)) # do all 1326 samples have some peaks assigned
[1] TRUE
> min(tb)
[1] 1494

Thank you, Elzbieta

jorainer commented 6 years ago

I've re-implemented the obiwarp alignment for the new user interface, but did not change the function for the old interface. Could be that the bug was fixed for the new implementation. Could you please try the following:

library(xcms)
raw_data <- readMSData(file, mode = "onDisk")
cwp <- CentWaveParam(peakwidth = c(3, 10), ppm = 20, noise = 600, snthresh = 5,
    mzdiff = 0.005, prefilter = c(10, 3000), integrate = 2L)

xdata <- findChromPeaks(raw_data, param = cwp)
xdata <- adjustRtime(xdata, param = ObiwarpParam())
lauzikaite commented 6 years ago

Hi @jotsetung,

I can confirm that adjustRtime()from xcms3 works fine with small subsets of my data. However, the problem is that the full dataset comprises ~1,300 mzML files and I cannot move past findChromPeaks() stage, since memory requirement with just 20 bpnworkers exceeds the capacity of the Linux cluster I am using. I am still testing with fewer workers, but for some reason peak-picking does not start at all with such a large OnDiskMSnExp object.

This is rather strange, since I can easily read this number of datafiles with the old interface usingxcmsSet().

Can you maybe advise on the best approach forward? is there any way to use xcmsSet objects with the new interface for retention time alignment?

Thank you!

jorainer commented 6 years ago

The obiwarp code for xcmsSet objects is rather tricky and unclear - I'm afraid I introduce even more bugs if I try to fix that.

Regarding the performance: I'll try to keep the number of concurrent processes low, i.e. I'm never using more than 8 workers in parallel. The bottleneck with more workers is the I/O. you'll need to read the data from the mzML files and if too many tasks are reading at the same time from disk you're slowing down everything. Another reason is memory. To keep 20 objects at a time in memory you'll need a lot of memory - if you haven't got that memory, the OS will start swapping, and that is veeery slow.

Eventually you might try the following:

library(xcms)
library(doParallel)
registerDoParallel(8)
register(DoParParam(), default = TRUE)

this essentially allocates and registers the workers/threads. Any call in xcms that supports parallel processing will use these workers - no need to pass them with BPPARAM anymore. I had to do this on macOS since somehow the repetitive forking of processes (that's what happens if you're not pre-registering/allocating them) at some time took ages. I haven't experienced that on linux, but it might still worth a try.

lauzikaite commented 6 years ago

Thank you for your reply @jotsetung

I have tried the suggested approach on the Linux cluster:

> registerDoParallel(8)
> register(DoparParam(), default = TRUE)
> print(DoparParam())
class: DoparParam
  bpisup: TRUE; bpnworkers: 8; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bptimeout: 2592000; bpprogressbar: FALSE

> cwp <- CentWaveParam(peakwidth = c(1.5, 14),
                       ppm = 25, 
                       noise = 500,
                       snthresh = 5,
                       mzdiff = 0.01, 
                       prefilter = c(10, 3000),
                       integrate = 2,
                       fitgauss = FALSE)
 > xset <- findChromPeaks(rdat, param = cwp)

However, running findChromPeaks()with ~1,300 mzML files takes ~2,200GB of memory and takes 20 hours. Is there a reason why memory usage gets so high in my case? Does it have to do anything with the system itself or perhaps the fact that our data files are very "information-dense"?

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 12 SP1

Matrix products: default
BLAS: /apps/R/3.4.0/lib64/R/lib/libRblas.so
LAPACK: /apps/R/3.4.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  methods   stats     graphics  grDevices utils     datasets 
[8] base     

other attached packages:
 [1] doParallel_1.0.11   iterators_1.0.8     foreach_1.4.3      
 [4] reshape2_1.4.2      xcms_3.0.0          MSnbase_2.4.0      
 [7] ProtGenerics_1.10.0 mzR_2.12.0          Rcpp_0.12.13       
[10] BiocParallel_1.12.0 Biobase_2.38.0      BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] compiler_3.4.0         BiocInstaller_1.28.0   RColorBrewer_1.1-2    
 [4] plyr_1.8.4             tools_3.4.0            zlibbioc_1.24.0       
 [7] MALDIquant_1.16.4      digest_0.6.12          preprocessCore_1.40.0 
[10] tibble_1.3.4           gtable_0.2.0           lattice_0.20-35       
[13] rlang_0.1.4            Matrix_1.2-11          stringr_1.2.0         
[16] S4Vectors_0.16.0       IRanges_2.12.0         stats4_3.4.0          
[19] multtest_2.34.0        grid_3.4.0             impute_1.52.0         
[22] survival_2.41-3        XML_3.98-1.9           RANN_2.5.1            
[25] limma_3.34.0           magrittr_1.5           ggplot2_2.2.1         
[28] MASS_7.3-47            splines_3.4.0          scales_0.5.0          
[31] pcaMethods_1.70.0      codetools_0.2-15       MassSpecWavelet_1.44.0
[34] mzID_1.16.0            colorspace_1.3-2       stringi_1.1.5         
[37] affy_1.56.0            lazyeval_0.2.1         munsell_0.4.3         
[40] vsn_3.46.0             affyio_1.48.0     
jorainer commented 6 years ago

Thanks for the update! Strange, memory demand should be relatively low. If you're using 8 workers only 8 files are read at a time, so (unless your files are really huge) you should never get to 2,200GB. Could be that the memory, although freed by R (and no longer used), is not physically returned to the OS. It would be returned to the OS if another process would need more memory. How that is handled depends on the OS. Regarding the 20 hours run time. that sounds actually not too bad. So you're basically processing 65 files per hour!

One note on the mzML files: reading from gzipped mzML files (i.e. mzML.gz) can be slow. We're usually compressing the binary data within the mzML files, but not the complete mzML files.

mheiser-md commented 6 years ago

@lauzikaite could it be that your peak picking parameters are causing this? e.g. with peakwidth = c(1.5, 14) I would expect to find a lot of peaks. And a snthresh of 5 also seems pretty low (but then again the prefilter should compensate this) How many peaks does centWave find in a sample with those settings?

lauzikaite commented 6 years ago

Thanks for your suggestions @jotsetung and @mheiser-md

findChromPeaks(..., param = CentWaveParam(peakwidth = c(1.5, 14),
                       ppm = 25, 
                       noise = 500,
                       snthresh = 5,
                       mzdiff = 0.01, 
                       prefilter = c(10, 3000),
                       integrate = 2,
                       fitgauss = FALSE) )

Running with the above parameters (chosen based on the raw data) returns a relatively large number of peaks, but this is what we normally aim for with our datasets:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1623    3208    3594    3790    4116   13683 

I've been trying to find a set of parameters that would allow to reduce memory demand while retaining as many small peaks as possible. However, memory demand is too large to peak-pick a 100 mzML files on macOS with 16GB RAM using just 4 workers.

readMSData() runs smoothly and without huge RAM overload, while findChromPeaks() crashes.

< registerDoParallel(4)
< register(DoparParam(), default = TRUE)
< print(DoparParam())
class: DoparParam
  bpisup: TRUE; bpnworkers: 4; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bptimeout: 2592000; bpprogressbar: FALSE

 > rdat <- readMSData(files,
                     msLevel. = 1,
                     mode = "onDisk",
                     verbose = T)
> cwp <- CentWaveParam(peakwidth = c(1.5, 14), 
                       ppm = 25, 
                       noise = 1000, # increased from 500 
                       snthresh = 5,
                       mzdiff = 0.01, 
                       prefilter = c(10, 10000),  # increased from 3000 
                       integrate = 2,
                       fitgauss = FALSE)
>  xset <- findChromPeaks(rdat, param = cwp)

It seems that our datafiles are too much for this method? What memory demand do you normally expect to see when working with UPLC-MS files?

mheiser-md commented 6 years ago

did you have a look at IPO? For us, the xcms parameters generated by IPO typically yield between 3,000- 10,000 peaks per sample. In addition with these parameters xcms rarely misses any peaks and they are well enough integrated. Unless you have some crazy chromatography that is less than a minute long, I would expect a peakwidth of c(1.5, 14) would split your real peaks into several small ones. How many do you find in just one sample with these paremeters?

See also: http://www.metabolomics-forum.com/index.php?topic=222.0

jorainer commented 6 years ago

@lauzikaite , thanks for providing feedback. One more thing, could you please provide the dimension of chromPeaks(xset)? And also the pryr::object_size(xset). Could be that this matrix gets so large that you run out of memory - haven't thought about that yet.

Also, could you please provide length(mz(rdat[1]))? would be interesting to know how many data triplets you have in one file... one of our typical (centroided) MS files has about 1700 spectra and about 4,000,000 data triplets (i.e. m/z, intensity and rt values). So far I did run findChromPeaks on only 690 of these files and did not run into any memory problems - I ran that on macOS with 16GB of ram.

@mheiser-md , the peakwidth of @lauzikaite is not so small. We also run a fast chromatography to speed up the measurements (because we had to measure 5000 samples) and I'm using a peakwidth = c(1.5, 8) - most of the chromatographic peaks we have are ~ 4 seconds long (in fact IPO did suggest a similar peak width).

lauzikaite commented 6 years ago

@jotsetung thanks for the update. Yes, our datafiles are rather large.

> dim(chromPeaks(xset))
3710789      12
> pryr::object_size(xset)
1.56 GB
> length(mz(rdat[1))
4328
> sum(peaksCount(rdat[1))
52049529 

@mheiser-md, I will look into IPO for more optimal parameters, however, preliminary tests with higher noise level and smaller peak-width windows suggest that memory demand cannot be solved just with this unfortunately.

jorainer commented 6 years ago

OK, size for the xset does not seem so bad. we might be able to reduce the size if we used a e.g. data.frame (or DataFrame) instead of a matrix, but will not be much smaller...

what I find interesting is the number of data points you have in one file. 50,000,000 is quite large - is this in profile mode? our profile mode mzML also have ~ 60,000,000 peaks, but after centroiding they are down to ~ 2,000,000.

lauzikaite commented 6 years ago

Yes, these files are centroided. I wonder did the way data was recorded could influence this? Here is a sample raw files and the corresponding mzML.

jorainer commented 6 years ago

hm, what I was now wondering, if I make a matrix with the same number of rows as you have in your chromPeaks matrix, this sums up to only ~ 350MB, but you've got 1.6GB. Could you please also make a object_size(fData(xset)) @lauzikaite ? could be that the featureData data.frame for your files is getting laaarge. If that's the case, I'm currently working on a way to reduce this featureData data.frame to only necessary columns.

jorainer commented 6 years ago

Had a look at your file @lauzikaite . Within the <dataProcessingList section of the mzML file there is only "Conversion to mzML" listed, so you used proteowizard to convert from waters format to mzML. I assume centroiding has been performed within the waters software. We are not doing the centroiding using the manufacturer software (i.e. AB Sciex), but (up to now) with proteowizard. Thus we have "Conversion to mzML" and "peak picking" in the <dataProcessingList of our mzML files.

library(xcms)

on_disk <- readMSData("X_HPOS_ToF09_P41W53.mzML", mode = "onDisk")

## Extracting one random spectrum
sp <- on_disk[[1000]]
plot(mz(sp), intensity(sp), type = "h")

rplot

In addition we call pickPeaks on the object (performs some simple centroiding) and zoom into one peak.

on_disk_c <- pickPeaks(on_disk)
## Extract the same spectrum from the centroided object
sp_c <- on_disk_c[[1000]]

## Zoom into a mz range and plot the data
mzr <- c(870, 890)
par(mfrow = c(2, 1))
plot(mz(filterMz(sp, mzr)), intensity(filterMz(sp, mzr)), type = "h",
     main = "unprocessed")
plot(mz(filterMz(sp_c, mzr)), intensity(filterMz(sp_c, mzr)), type = "h",
     main = "pickPeaks")

rplot

In the upper plot (original file) there are many low intensity signals between the bigger signals. We further zoom into the region and extract m/z, intensity and retention time values and plot these.

mzr <- c(878, 880)
msd <- extractMsData(on_disk, mz = mzr, rt = c(160, 180))
plotMsData(msd[[1]])

rplot

For the unprocessed data there seem to be several chromatographic peaks in about regular m/z distances (the yellowish horizontal lines). Between these are many low intensity signals (dark blue). Below we plot the same for the data after calling a simple pickPeaks.

mzr <- c(878, 880)
msd_c <- extractMsData(on_disk_c, mz = mzr, rt = c(160, 180))
plotMsData(msd_c[[1]])

rplot

The chromatographic peak signal seems to be retained. Doing (an additional?) centroiding on this data file reduces the object size (if the data was loaded into memory) from 983MB to 134MB. You would most likely no longer run into memory issues and also processing time should be reduced.

I'm not an expert in Waters data (and also have no clue about your experimental or machine setup), but to me it looks like your data is not centroided. I suggest you look at similar plots as above for some compounds you know where to look for (in terms of mz and rt range) - could be internal standards or some known compounds. Eventually you might also give some additional centroiding a try. You could re-run your analysis with and without an additional pickPeaks call on your raw object and compare the results (note that the peak picking does also require some time).

SiggiSmara commented 6 years ago

Just to add a bit of confusion to this topic. Looking at the data that @jotsetung produced I would say that the data is centroided but there is a lot of low signal centroids in the mzML file. A simple signal to noise filter on the spectra might also reduce the memory requirements for subsequent analyses if you save the resulting data to a new mzML file.

And also be aware that msconvert is not always behaving in the same way between versions. I have currently two 64 bit versions of msconvert from proteowizard 3.0.9463 and 3.0.11626 (current as of today, 12/12/2017). Extracting the profile data from both versions generates files of vastly different sizes, with 3.09463 generating a file of 477 MB while the 3.0.11626 extraced file is close 898 MB. This is for the command line tool, both gui tools generated slightly larger files than their command line counterparts, probably due to different default values that are not immediately clear to me. I haven't looked at the data in these files.

I did however also investigate what happens if I convert with peakpicking (centroiding) using proteowizard cwt algorithm. In both cases the file sizes were identical to the profile data. If you do the conversion in the gui msconvert tool be aware that it does not point out that vendor specific centroiding is not available for this file in the 3.0.9463 version but both command line tools give you a warning about this.

I also tried with different signal to noise thresholds on the cwt centroiding but no change in file size either.

Since the cwt centroiding seems not to be doing much of anything I thing it is safe to assume that the original file is in deed centroid of some sort but also that proteowizard is not going to help you in reducing the problem with memory.

In my opinion, this means that you need to work on the data post-proteowizard, either re-centroiding as @jotsetung suggested or you could try a s/n filter or some other more clever way of getting rid of the low intensity signal that is taking up most of your memory.

Or you could potentially do the s/n reduction on the Waters side prior to converting to mzML. It's been too long since I worked with Waters software but I would be surprised if this is not possible.

trivedi-group commented 5 years ago

I am not adding anything new to this discussion but it is very interesting to see Waters files have been an issue back in 2017 and in 2019 the problem remains. I am experiencing exactly the same error as described in OP. The only difference is my cluster is not running out of memory thankfully! But the file sizes are as big as described here 1.6 to 2.1 GB, all centroided data collected. ProteoWizard doesn't change anything in terms of file size. But retcor failes. I am able to read the files fine using xcmsSet command. But retcor fails with same error.

A "Waters-Filter" might be useful to get rid of so much redundant low intensity data before retcor.