sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
178 stars 81 forks source link

Conflict between Waters lock mass and filling peaks #173

Closed yufree closed 7 years ago

yufree commented 7 years ago

Hi, I encountered a wired issues about Waters data.

I convert raw data from RAW to mzxml with msconvert( now they have lockmass refiner to correct the data). The mzxml data still have the lock mass gap. So I use lockMassFreq = T to fill the gap and it could find more peaks using stitch method. However, if I use this method, the fillpeaks would always return BiocParallel errors:

Error: BiocParallel errors
  element index: 1, 2, 3
  first error: 'fromIdx' has to be smaller than 'toIdx'!

If I set lockMassFreq = F, the errors disappeared.

I was confused by the conflict between those two function. Here are the dataset and codes I used:

demo dataset: https://figshare.com/articles/demo_data_for_the_error/5007965

code:

cdffiles <- list.files(path = 'D:/data/blk/', recursive = TRUE, full.names = TRUE)
xset <-
xcms::xcmsSet(
cdffiles,
BPPARAM = BiocParallel::SnowParam(workers = 12),
method = "centWave",
ppm = 15,
peakwidth = c(5, 20),
prefilter = c(0, 0)
)
# This is the one with correction
# xset <-
# xcms::xcmsSet(
# cdffiles,
# BPPARAM = BiocParallel::SnowParam(workers = 12),
# method = "centWave",
# ppm = 15,
# peakwidth = c(5, 20),
# prefilter = c(0, 0),
# lockMassFreq = T
# )
xset <- xcms::group(xset,
bw = 2,
mzwid = 0.015)
xset2 <- xcms::retcor(xset, method="obiwarp")
xset2 <-
xcms::group(xset2,
bw = 2,
mzwid = 0.015)
xset3 <- fillPeaks(xset2)

Here is my rsessioninfo:

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 
[2] LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    
attached base packages:
[1] parallel  stats     graphics  grDevices
[5] utils     datasets  methods   base     
other attached packages:
 [1] dplyr_0.5.0          
 [2] purrr_0.2.2          
 [3] readr_1.1.0          
 [4] tidyr_0.6.1          
 [5] tibble_1.3.0         
 [6] ggplot2_2.2.1        
 [7] tidyverse_1.1.1      
 [8] enviGCMS_0.2.1       
 [9] xsetplus_0.1.1       
[10] xcms_1.52.0          
[11] MSnbase_2.2.0        
[12] ProtGenerics_1.8.0   
[13] mzR_2.10.0           
[14] Rcpp_0.12.10         
[15] xMSannotator_1.2.1   
[16] Rdisop_1.36.0        
[17] RcppClassic_0.9.6    
[18] pcaMethods_1.68.0    
[19] Biobase_2.36.0       
[20] BiocGenerics_0.22.0  
[21] flashClust_1.01-2    
[22] KEGGREST_1.16.0      
[23] plyr_1.8.4           
[24] rjson_0.2.15         
[25] png_0.1-7            
[26] SSOAP_0.9-0          
[27] doSNOW_1.0.14        
[28] snow_0.4-2           
[29] iterators_1.0.8      
[30] foreach_1.4.3        
[31] RCurl_1.95-4.8       
[32] bitops_1.0-6         
[33] WGCNA_1.51           
[34] fastcluster_1.1.22   
[35] dynamicTreeCut_1.63-1
[36] R2HTML_2.3.2         
[37] XML_3.98-1.6         
[38] BiocParallel_1.10.0  
loaded via a namespace (and not attached):
 [1] colorspace_1.3-2      
 [2] qvalue_2.8.0          
 [3] htmlTable_1.9         
 [4] XVector_0.16.0        
 [5] base64enc_0.1-3       
 [6] affyio_1.46.0         
 [7] AnnotationDbi_1.38.0  
 [8] lubridate_1.6.0       
 [9] xml2_1.1.1            
[10] codetools_0.2-15      
[11] splines_3.4.0         
[12] mnormt_1.5-5          
[13] doParallel_1.0.10     
[14] impute_1.50.0         
[15] knitr_1.15.20         
[16] Formula_1.2-1         
[17] jsonlite_1.4          
[18] annotate_1.54.0       
[19] broom_0.4.2           
[20] cluster_2.0.6         
[21] vsn_3.44.0            
[22] GO.db_3.4.1           
[23] compiler_3.4.0        
[24] httr_1.2.1            
[25] backports_1.0.5       
[26] assertthat_0.2.0      
[27] Matrix_1.2-10         
[28] lazyeval_0.2.0        
[29] limma_3.32.2          
[30] acepack_1.4.1         
[31] htmltools_0.3.6       
[32] tools_3.4.0           
[33] gtable_0.2.0          
[34] affy_1.54.0           
[35] RANN_2.5              
[36] reshape2_1.4.2        
[37] MALDIquant_1.16.2     
[38] cellranger_1.1.0      
[39] Biostrings_2.44.0     
[40] multtest_2.32.0       
[41] preprocessCore_1.38.0 
[42] nlme_3.1-131          
[43] psych_1.7.5           
[44] stringr_1.2.0         
[45] rvest_0.3.2           
[46] zlibbioc_1.22.0       
[47] MASS_7.3-47           
[48] scales_0.4.1          
[49] BiocInstaller_1.26.0  
[50] hms_0.3               
[51] MassSpecWavelet_1.42.0
[52] RColorBrewer_1.1-2    
[53] memoise_1.1.0         
[54] gridExtra_2.2.1       
[55] rpart_4.1-11          
[56] latticeExtra_0.6-28   
[57] stringi_1.1.5         
[58] RSQLite_1.1-2         
[59] genefilter_1.58.0     
[60] S4Vectors_0.14.0      
[61] checkmate_1.8.2       
[62] matrixStats_0.52.2    
[63] mzID_1.14.0           
[64] lattice_0.20-35       
[65] htmlwidgets_0.8       
[66] magrittr_1.5          
[67] R6_2.2.0              
[68] IRanges_2.10.0        
[69] Hmisc_4.0-3           
[70] DBI_0.6-1             
[71] mgcv_1.8-17           
[72] haven_1.0.0           
[73] foreign_0.8-68        
[74] survival_2.41-3       
[75] nnet_7.3-12           
[76] modelr_0.1.0          
[77] grid_3.4.0            
[78] readxl_1.0.0          
[79] sva_3.24.0            
[80] data.table_1.10.4     
[81] XMLSchema_0.8-0       
[82] forcats_0.2.0         
[83] digest_0.6.12         
[84] xtable_1.8-2          
[85] stats4_3.4.0          
[86] munsell_0.4.3

I checked the code and doubt the error might come from the bin method for fillPeaks. However, I am not sure about the details. Such errors might be reproduce on Waters data if using lockMassFreq = T.

sneumann commented 7 years ago

Hi Miao YU,

thanks for the detailed reporting. I only get warnings, and couldn't check the resulting xset in detail. I also started a discussion in #174 if there is a future way to avoid the lockmass correction in first place. Your input there would be highly appreciated.

##This is the one with correction
xset <-
xcms::xcmsSet(
cdffiles,
BPPARAM = BiocParallel::SnowParam(workers = 12),
method = "centWave",
ppm = 15,
peakwidth = c(5, 20),
prefilter = c(0, 0),
lockMassFreq = T
)
[...]
In matrix(lockMass, ncol = 2, byrow = TRUE) :In matrix(lockMass, ncol = 2, byrow = TRUE) :In matrix(lockMass, ncol = 2, byrow = TRUE) :
  data length [11] is not a sub-multiple or multiple of the number of rows [6]
  data length [11] is not a sub-multiple or multiple of the number of rows [6]
  data length [11] is not a sub-multiple or multiple of the number of rows [6]
> xset xset 
An "xcmsSet" object with 3 samples

Time range: 3.6-1079.7 seconds (0.1-18 minutes)
Mass range: 104.1082-999.8066 m/z
Peaks: 3481 (about 1160 per sample)
Peak Groups: 0 
Sample classes: figshare 

Peak picking was performed on MS1.
Profile settings: method = bin
                  step = 0.1

Memory usage: 0.53 MB

Yours, Steffen


> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] xcms_1.50.1         Biobase_2.34.0      ProtGenerics_1.6.0 
[4] BiocGenerics_0.20.0 mzR_2.8.1           Rcpp_0.12.10       

loaded via a namespace (and not attached):
 [1] RANN_2.5               lattice_0.20-34        codetools_0.2-15      
 [4] snow_0.4-2             MASS_7.3-45            MassSpecWavelet_1.40.0
 [7] grid_3.3.2             plyr_1.8.4             stats4_3.3.2          
[10] S4Vectors_0.12.2       Matrix_1.2-8           splines_3.3.2         
[13] BiocParallel_1.8.2     RColorBrewer_1.1-2     tools_3.3.2           
[16] compiler_3.3.2         survival_2.40-1        multtest_2.30.0  ````
yufree commented 7 years ago

Hi Steffen,

Thanks for your test!

Today I also test it on Ubuntu 16.10 and same error return. Also from my Mac. Based on this, I wonder something happened related to operation system compiler for BiocParallel.

Besides, the error only returned when I want to fill peaks using fillPeaks instead of xsetSet. This is the most wired thing, they should use the similar profile methods.

I would try another method to find the source of this error.

Update: Here is the error:

Error: BiocParallel errors
  element index: 1, 2, 3
  first error: 'fromIdx' has to be smaller than 'toIdx'! 
6.
stop(.error_bplist(res)) 
5.
bplapply(argList, fillPeaksChromPar, BPPARAM = BPPARAM) 
4.
bplapply(argList, fillPeaksChromPar, BPPARAM = BPPARAM) at methods-xcmsSet.R#1181
3.
.local(object, ...) 
2.
fillPeaks.chrom(xset2) at AllGenerics.R#105
1.
fillPeaks.chrom(xset2) 

I check the source code and I found the following code for fillPeaks (MPI.R) related to this issue:

lcraw <- xcmsRaw(arg$file, profmethod=params$prof$method, profstep = 0)

    if (length(params$dataCorrection) > 1) {
        ## Note: dataCorrection (as set in the xcmsSet function) is either
        ## 1 for all or for none.
        if (any(params$dataCorrection) == 1)
            lcraw <- stitch(lcraw, AutoLockMass(lcraw))
    }

I also saw stitch function in findpeak which looks similar to the code above:

xRaw <- xcmsRaw(arg$file, profmethod=params$profmethod,
                    profparam=params$profparam, profstep = 0,
                    includeMSn=params$includeMSn, mslevel=params$mslevel,
                    scanrange=params$scanrange)
    if(params$lockMassFreq == TRUE){
        xRaw <- stitch(xRaw, AutoLockMass(xRaw))
    }

The issue is the latter could work fine while the former from fillpeaks would return an error.

Thanks, Miao

jorainer commented 7 years ago

Thanks @yufree for looking into the code. What I find a little strange now is that both methods use a different parameter (lockMassFreq vs dataCorrection). I'll investigate.

hpbenton commented 7 years ago

The code was made to work with the snow framework and probably wasn't updated to the bioconductor parallel version. Datacorrection is essentially a binary and lockmassfreq is a found parameter as asking the user to find the lock mass frequency themselves is unrealistic. The whole trouble with this situation is you don't have access to the original waters raw file therefore it's really hard to tell where the lock mass scans are in a robust fashion across labs. I havnt looked at the method in R XCMS in a while so datacorrection I believe is a matrix of binary for the each file but I could be mistaken. Late here - maybe tomorrow.

Sent from my mobile

On May 18, 2017, at 10:38 PM, Johannes Rainer notifications@github.com wrote:

Thanks @yufree for looking into the code. What I find a little strange now is that both methods use a different parameter (lockMassFreq vs dataCorrection). I'll investigate.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

jorainer commented 7 years ago

Checked the code. The lockMassFreq and the dataCorrection are fine. The error is thrown by the new binning function binYonX after data stitching. While centWave does not require the profile matrix generation (i.e. the binning), the fillPeaks does, since peaks are filled-in from the profile matrix (and not the actual raw data).

I suspect it has something to do with the @scanindex vector.

jorainer commented 7 years ago

Now, a shorter test that shows the error is:

> library(xcms)
> cdffiles <- dir(pattern = "mzXML")
> lr <- xcmsRaw(cdffiles[1])
Create profile matrix with method 'bin' and step 1 ... OK
> lr2 <- stitch(lr, AutoLockMass(lr))
Warning message:
In matrix(lockMass, ncol = 2, byrow = TRUE) :
  data length [11] is not a sub-multiple or multiple of the number of rows [6]
> pm <- profMat(lr2, step = 1)
Create profile matrix with method 'bin' and step 1 ... Error in binYonX(mz, int, breaks = brks, fromIdx = fromIdx, toIdx = toIdx,  : 
  'fromIdx' has to be smaller than 'toIdx'!
jorainer commented 7 years ago

OK, fixed it:

> library(xcms)
> cdffiles <- dir(pattern = "mzXML")
> lr <- xcmsRaw(cdffiles[1])
Create profile matrix with method 'bin' and step 1 ... OK
> lr2 <- stitch(lr, AutoLockMass(lr))
> pm <- profMat(lr2, step = 1)
Create profile matrix with method 'bin' and step 1 ... OK

The stitch.xml method was looping from scan 1:length() -1 and thus missed to add the mz and intensity values from the last spectrum to the new, stitched, object (while still adding the information to the @scanindex vector). This caused this error.

See notes in the code of commit https://github.com/sneumann/xcms/commit/56c43ddfd7eece4fe58f1f81da66714bf23e32ef#diff-69d6894f08d72ebed9550250be653318R2376 for explanations and the fix.

jorainer commented 7 years ago

@yufree since it might take some time until we merge that into Bioc devel: to install the newest xcms from github (that includes the fix):

library(devtools)
install_github("sneumann/xcms", ref = "xcms3")
yufree commented 7 years ago

Now it works!

Thanks a lot! @jotsetung @hpbenton @sneumann