sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
183 stars 80 forks source link

Problem with fillChromPeaks #653

Closed felipe-mansoldo closed 1 year ago

felipe-mansoldo commented 1 year ago

Dear all,

I'm working with low resolution spectra. I'm having problems when performing fillChromPeaks. Please check it:

register(SerialParam())
xcms.data <- readMSData(files = fls[1:3],
                      pdata = new("NAnnotatedDataFrame", pd[1:3,]),
                      mode = "onDisk")

mfp <- MatchedFilterParam(binSize = 1)
snthresh(mfp) <- 1
message("- findChromPeaks ... ")
xdata <- findChromPeaks(xcms.data, param=mfp)
fdp <- PeakDensityParam(sampleGroups = xcms.data$sample_group,
                        binSize = 1)
res <- groupChromPeaks(xdata, fdp)
sum(is.na(featureValues(res)))
res <- fillChromPeaks(res, param = FillChromPeaksParam())

as result we have:

> res <- groupChromPeaks(xdata, fdp)
Processing 1847 mz slices ... OK
> sum(is.na(featureValues(res)))
[1] 1056
> res <- fillChromPeaks(res, param = FillChromPeaksParam())
Defining peak areas for filling-in .... OK
Start integrating peak areas from original files
Requesting 515 peaks from spectra_emend.mzML ... Error: BiocParallel errors
  1 remote errors, element index: 1
  2 unevaluated and other errors
  first remote error:
Error in binYonX(mz, int, breaks = brks, fromIdx = fromIdx, toIdx = toIdx, : 'fromIdx' and 'toIdx' have to be >= 0!

SessionInfo

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggplot2_3.4.0       dplyr_1.0.10        xcms_3.20.0         MSnbase_2.24.0     
 [5] mzR_2.32.0          Rcpp_1.0.9          Biobase_2.58.0      Spectra_1.8.0      
 [9] ProtGenerics_1.30.0 BiocParallel_1.32.1 S4Vectors_0.36.0    BiocGenerics_0.44.0

loaded via a namespace (and not attached):
 [1] MatrixGenerics_1.10.0       tidyr_1.2.1                 vsn_3.66.0                 
 [4] foreach_1.5.2               BiocManager_1.30.19         affy_1.76.0                
 [7] cellranger_1.1.0            GenomeInfoDbData_1.2.9      ggrepel_0.9.2              
[10] robustbase_0.95-0           impute_1.72.1               factoextra_1.0.7           
[13] pillar_1.8.1                lattice_0.20-45             glue_1.6.2                 
[16] limma_3.54.0                digest_0.6.30               GenomicRanges_1.50.1       
[19] RColorBrewer_1.1-3          XVector_0.38.0              colorspace_2.0-3           
[22] preprocessCore_1.60.0       Matrix_1.5-3                plyr_1.8.8                 
[25] MALDIquant_1.22             XML_3.99-0.12               pkgconfig_2.0.3            
[28] zlibbioc_1.44.0             purrr_0.3.5                 scales_1.2.1               
[31] RANN_2.6.1                  affyio_1.68.0               tibble_3.1.8               
[34] generics_0.1.3              IRanges_2.32.0              withr_2.5.0                
[37] SummarizedExperiment_1.28.0 cli_3.4.1                   MassSpecWavelet_1.64.0     
[40] magrittr_2.0.3              readxl_1.4.1                fs_1.5.2                   
[43] ncdf4_1.19                  fansi_1.0.3                 doParallel_1.0.17          
[46] MASS_7.3-58.1               MsFeatures_1.6.0            tools_4.2.0                
[49] lifecycle_1.0.3             matrixStats_0.63.0          munsell_0.5.0              
[52] cluster_2.1.4               DelayedArray_0.24.0         pcaMethods_1.90.0          
[55] compiler_4.2.0              GenomeInfoDb_1.34.4         mzID_1.36.0                
[58] rlang_1.0.6                 grid_4.2.0                  RCurl_1.98-1.9             
[61] iterators_1.0.14            rstudioapi_0.14             MsCoreUtils_1.10.0         
[64] bitops_1.0-7                gtable_0.3.1                codetools_0.2-18           
[67] DBI_1.1.3                   R6_2.5.1                    utf8_1.2.2                 
[70] clue_0.3-62                 parallel_4.2.0              vctrs_0.5.0                
[73] DEoptimR_1.0-11             tidyselect_1.2.0

If you want to look at the raw files, you can download them here: https://drive.google.com/drive/folders/1zK14Mv6370A8wdYyBQtULDCF99hYwYbv?usp=share_link

thanks, my best

linlennypinawa commented 1 year ago

I downloaded your raw files and run your codes.

library(xcms)

register(SerialParam())

dest.filepath <- r"(D:\Projects_R\Project_32_Problem with fillChromPeaks #653\data)"

groups_list <- list.files(path =dest.filepath, pattern = ".mzML", full.name = T)

sample_names <- data.frame(sample_name = sub(basename(groups_list), pattern = ".mzML",
                                             replacement = "", fixed = TRUE),
                           sample_group = rep("samples", 3),
                           id = c(1:3),
                           stringsAsFactors = FALSE)

xcms.data <- readMSData(files = groups_list,
                        pdata = new("NAnnotatedDataFrame", sample_names),
                        mode = "onDisk")

mfp <- MatchedFilterParam(binSize = 1)
snthresh(mfp) <- 1
message("- findChromPeaks ... ")
xdata <- findChromPeaks(xcms.data, param=mfp)
fdp <- PeakDensityParam(sampleGroups = xcms.data$sample_group,
                        binSize = 1)
res <- groupChromPeaks(xdata, fdp)
sum(is.na(featureValues(res)))
res <- fillChromPeaks(res, param = FillChromPeaksParam())

I got the following messages:

...
> sum(is.na(featureValues(res)))
[1] 82
> res <- fillChromPeaks(res, param = FillChromPeaksParam())
Defining peak areas for filling-in .... OK
Start integrating peak areas from original files
Requesting 31 peaks from spectra_zao.mzML ... 
Requesting 51 peaks from spectra_emend.mzML ... 
Requesting 0 peaks from spectra_zaoZero.mzML ... FAIL: no MS level 1 data available.

Error: BiocParallel errors
  2 remote errors, element index: 1, 2
  0 unevaluated and other errors
  first remote error:
Error in binYonX(mz, int, breaks = brks, fromIdx = fromIdx, toIdx = toIdx, : 'fromIdx' and 'toIdx' have to be >= 0!
In addition: Warning messages:
1: In serialize(data, node$con) :
  'package:stats' may not be available when loading
2: In serialize(data, node$con) :
  'package:stats' may not be available when loading
3: In serialize(data, node$con) :
  'package:stats' may not be available when loading

First off, the outcome of sum(is.na(featureValues(res))) is 82, instead of 1056.

Secondly, you failed at Requesting 515 peaks from spectra_emend.mzML, but I was fine with processing this file.

Thirdly, it says this: Requesting 0 peaks from spectra_zaoZero.mzML ... FAIL: no MS level 1 data available.

Please check the MS level 1 data availability. I wish it would give you hint.

felipe-mansoldo commented 1 year ago

Dear @linlennypinawa , Thanks for your kind help. The problem persists and I will answer your questions:

My spectra base is bigger than these 3 that I shared, and this changed the: sum(is.na(featureValues(res)))

I ran the codes again, now with only the 3 spectra and also got the same number.

The spectra_zaoZero.mzML is fine, see the screenshot: screen1

So I don't know what could be wrong yet.


> xdata
MSn experiment data ("XCMSnExp")
Object size in memory: 1.3 Mb
- - - Spectra data - - -
 MS level(s): 1 
 Number of spectra: 4087 
 MSn retention times: 0:00 - 3:60 minutes
- - - Processing information - - -
Data loaded [Tue Jan  3 14:44:28 2023] 
 MSnbase version: 2.22.0 
- - - Meta data  - - -
phenoData
  rowNames: 1 2 3
  varLabels: sample_name sample_group id
  varMetadata: labelDescription
Loaded from:
  [1] spectra_emend.mzML...  [3] spectra_zaoZero.mzML
  Use 'fileNames(.)' to see all files.
protocolData: none
featureData
  featureNames: F1.S001 F1.S002 ... F3.S1592 (4087 total)
  fvarLabels: fileIdx spIdx ... spectrum (35 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
- - - xcms preprocessing - - -
Chromatographic peak detection:
 method: matchedFilter 
 1104 peaks identified in 3 samples.
 On average 368 chromatographic peaks per sample.
> fdp <- PeakDensityParam(sampleGroups = xcms.data$sample_group,
+                         binSize = 1)
> res <- groupChromPeaks(xdata, fdp)
Processing 1847 mz slices ... OK
> sum(is.na(featureValues(res)))
[1] 82
> res <- fillChromPeaks(res, param = FillChromPeaksParam())
Defining peak areas for filling-in .... OK
Start integrating peak areas from original files
Requesting 51 peaks from spectra_emend.mzML ... Error: BiocParallel errors
  1 remote errors, element index: 1
  2 unevaluated and other errors
  first remote error:
Error in binYonX(mz, int, breaks = brks, fromIdx = fromIdx, toIdx = toIdx, : 'fromIdx' and 'toIdx' have to be >= 0!
linlennypinawa commented 1 year ago

I wonder if you could adjust the retention time then execute fillChromPeaks() and see if it makes difference.

I also run featureValues(res) in a scenario of one file and the scenario of two files.

for ...emend..., the length(featureValues(res)) = 179 for ...zao..., the length(featureValues(res)) = 123 for both ...emend... & ...zao..., the length(featureValues(res)) = 544

This is the details for further troubleshooting.

Another observation is the retention time ranges are different among three files.

$spectra_emend
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.01   31.35   62.67   62.67   94.00  125.33 

$spectra_zao
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0113  59.9172 119.8350 119.8297 179.7393 239.6553 

$spectra_zaoZero
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0099  58.2465 116.4894 116.4815 174.7107 232.9401 

So I put ...zao... & ...zaoZero... in the third scenario. I encountered the same problem.

felipe-mansoldo commented 1 year ago

Maybe the problem really is the different retention times. But this is the nature of data that comes from direct injection.

An alternative would be MzClustParam (https://www.bioconductor.org/packages/devel/bioc/vignettes/xcms/inst/doc/xcms-direct-injection.html), however the samples contain multiple spectra per sample.. and this algorithm does only work for single spectra files/samples.

linlennypinawa commented 1 year ago

The retention time would not be the problem, because I suspected it initially, so executed filterRt() to get the same time range. The problem is still there. I am looking into this: https://bioconductor.org/packages/devel/bioc/vignettes/BiocParallel/inst/doc/Errors_Logs_And_Debugging.pdf.

Please take a look at it, and see what we can find.

linlennypinawa commented 1 year ago

I am curious why there are no signals at some times? I expect signals, even noise.

felipe-mansoldo commented 1 year ago

Good idea about using filterRt(). I also can't say why the zeroed signals, it may have been in the mscovert conversion.

sneumann commented 1 year ago

Hi, without having looked at the data, we have experienced zero intensities if the mzmin/mzmax from the other samples give a rather narrow mz window to integrate from, and it gets worse if the sample with the NA has (slightly) different mz calibration. Then the fillPeaks will integrate nothing (zeros) from above or below where the wanted signal is. => Plot the raw data and mz/rt window for some of the peaks with zero intensity of the file with NA. Yours, Steffen

linlennypinawa commented 1 year ago

Hi, without having looked at the data, we have experienced zero intensities if the mzmin/mzmax from the other samples give a rather narrow mz window to integrate from, and it gets worse if the sample with the NA has (slightly) different mz calibration. Then the fillPeaks will integrate nothing (zeros) from above or below where the wanted signal is. => Plot the raw data and mz/rt window for some of the peaks with zero intensity of the file with NA. Yours, Steffen

Dear Steffen,

I wonder if there is pre-requistion of data quality for successful data analysis. This case triggles my question: How shall I explore data ensuring data meets certain criteria before taking to futher steps?

linlennypinawa commented 1 year ago

Good idea about using filterRt(). I also can't say why the zeroed signals, it may have been in the mscovert conversion.

Hi Felipe-mansoldo,

I can tell that you use Thermo Sci LTQ XL with radial ejection linear ion trap from the raw data. I wonder if you configured the setting of ion trap for releasing ions to the MS detector.

Are there zero signals in other files? Maybe you could group those files with no zero signals and try to run codes with them, and see what you get.

felipe-mansoldo commented 1 year ago

@sneumann @linlennypinawa thank you for your help. It really seems that the problem is in the zeros. I will show the plot.

xdata.chr <- chromatogram(xdata, aggregationFun = "max")
xdata.chr.data <- xdata.chr[1, 1]
xdata.chr.df <- data.frame(rt = xdata.chr.data@rtime, int= xdata.chr.data@intensity)
ggplot(xdata.chr.df, aes(rt, int, color = factor(c(1))))+
  geom_path(aes(group = 1),size=1) 

image

About the zero signals. I went to check and all the files have zeros in them due to the nature of how we got the data. I remember that in some moments, in the acquisition, the screen really turned white.

To explore the case, I don't know if it's a correct way, I thought of replacing the zeros and NAs with a small value like 1/100000 .

xcms.data@featureData@data <- xcms.data@featureData@data %>% mutate(across(where(is.numeric), ~replace(., is.na(.), 1/100000 ) ) )
xcms.data@featureData@data <- xcms.data@featureData@data %>% mutate(across(where(is.numeric), ~replace(., .==0, 1/100000 ) ) )
xcms.data@featureData@data[["polarity"]] <- 0

and got the same error:

> res <- groupChromPeaks(xdata, fdp)
Processing 1847 mz slices ... OK
> sum(is.na(featureValues(res)))
[1] 82
> res <- fillChromPeaks(res, param = FillChromPeaksParam())
Defining peak areas for filling-in .... OK
Start integrating peak areas from original files
Requesting 51 peaks from spectra_emend.mzML ... Error: BiocParallel errors
  1 remote errors, element index: 1
  2 unevaluated and other errors
  first remote error:
Error in binYonX(mz, int, breaks = brks, fromIdx = fromIdx, toIdx = toIdx, : 'fromIdx' and 'toIdx' have to be >= 0!
> 
jorainer commented 1 year ago

@felipe-mansoldo I strongly advice against changing the data directly with @featureData... etc. The actual MS data is NOT kept in memory but only retrieved on-the-fly from the original raw data files.

Also, I would suggest to use the ChromPeakAreaParam instead of the FillChromPeaksParam in fillChromPeaks. The FillChromPeaksParam is the old way xcms was doing the peak filling, and that was IMHO underestimating the real data since the MS area from which the signal was integrated was only considering the ranges of the peak apexes but not the actual m/z and retention time ranges of the detected chromatographic peaks. ChromPeaksAreaParam will integrate signal from the m/z and rt range that represents the ranges observed for the detected chromatographic peaks instead and delivers quite accurate signal.

Maybe that already fixes your problem. Otherwise I would need to dig into the code to see where the problem stems from.

felipe-mansoldo commented 1 year ago

@jorainer thanks for the help and attention. I followed your indication with ChromPeakAreaParam and unfortunately the problem persists, see:

code:

library(xcms)
register(SerialParam())
dest.filepath <- r"(./data.share)"
groups_list <- list.files(path =dest.filepath, pattern = ".mzML", full.name = T)
sample_names <- data.frame(sample_name = sub(basename(groups_list), pattern = ".mzML",
                                             replacement = "", fixed = TRUE),
                           sample_group = rep("samples", 3),
                           id = c(1:3),
                           stringsAsFactors = FALSE)
xcms.data <- readMSData(files = groups_list,
                        pdata = new("NAnnotatedDataFrame", sample_names),
                        mode = "onDisk")
mfp <- MatchedFilterParam(binSize = 1)
snthresh(mfp) <- 1
xdata <- findChromPeaks(xcms.data, param=mfp)
fdp <- PeakDensityParam(sampleGroups = xcms.data$sample_group,
                        binSize = 1)
res <- groupChromPeaks(xdata, fdp)
sum(is.na(featureValues(res)))
res <- fillChromPeaks(res, param = ChromPeakAreaParam())

output:

> res <- groupChromPeaks(xdata, fdp)
Processing 1847 mz slices ... OK
> sum(is.na(featureValues(res)))
[1] 82
> res <- fillChromPeaks(res, param = ChromPeakAreaParam())
Defining peak areas for filling-in ... OK
Start integrating peak areas from original files
Requesting 51 peaks from spectra_emend.mzML ... Error: BiocParallel errors
  1 remote errors, element index: 1
  2 unevaluated and other errors
  first remote error:
Error in binYonX(mz, int, breaks = brks, fromIdx = fromIdx, toIdx = toIdx, : 'fromIdx' and 'toIdx' have to be >= 0!
> 
jorainer commented 1 year ago

I had a first check with your data - what puzzled me a bit was that the retention time ranges from the individual files does not match:

rts <- split(rtime(xdata), fromFile(xdata))
lapply(rts, range)
$`1`
[1]   0.0100 125.3256

$`2`
[1]   0.0113 239.6553

$`3`
[1]   0.0099 232.9401

I first suspected this to be the problem, but even after filtering the dataset with filterRt(xcms.data, rt = c(0.5, 125)) I still got the same error. So, this is a little strange, but not the cause of the error.

There seem however to be quite some empty spectra in the dataset. I thus ran

xcms.data <- filterEmptySpectra(xcms.data)

and after that I was able to run the fillChromPeaks without errors. Now, removing empty spectra also increased the number of detected chromatographic peaks per sample - the matched filter method is AFAIK relatively sensitive regarding missing or 0 values. Could you maybe please if this would also fix the problem on the full data set and if removing empty spectra still results in valid (and expected) peaks @felipe-mansoldo ?

felipe-mansoldo commented 1 year ago

Dear @jorainer, Here it didn't work as expected, please see the script and output:

library(xcms)

register(SerialParam())
dest.filepath <- r"(./data.share)"
groups_list <- list.files(path =dest.filepath, pattern = ".mzML", full.name = T)
sample_names <- data.frame(sample_name = sub(basename(groups_list), pattern = ".mzML",
                                             replacement = "", fixed = TRUE),
                           sample_group = rep("samples", 3),
                           id = c(1:3),
                           stringsAsFactors = FALSE)

xcms.data <- readMSData(files = groups_list,
                        pdata = new("NAnnotatedDataFrame", sample_names),
                        mode = "onDisk")
length(xcms.data)
xcms.data <- filterEmptySpectra(xcms.data)
length(xcms.data)

mfp <- MatchedFilterParam(binSize = 1)
snthresh(mfp) <- 1
xdata <- findChromPeaks(xcms.data, param=mfp)
fdp <- PeakDensityParam(sampleGroups = xcms.data$sample_group,binSize = 1)
res <- groupChromPeaks(xdata, fdp)
sum(is.na(featureValues(res)))
res <- fillChromPeaks(res, param = ChromPeakAreaParam())

outputs:


> length(xcms.data)
[1] 4087
> xcms.data <- filterEmptySpectra(xcms.data)
> length(xcms.data)
[1] 1222
> 
> mfp <- MatchedFilterParam(binSize = 1)
> snthresh(mfp) <- 1
> xdata <- findChromPeaks(xcms.data, param=mfp)
> fdp <- PeakDensityParam(sampleGroups = xcms.data$sample_group,binSize = 1)
> res <- groupChromPeaks(xdata, fdp)
Processing 1848 mz slices ... OK
> sum(is.na(featureValues(res)))
[1] 211
> res <- fillChromPeaks(res, param = ChromPeakAreaParam())
Defining peak areas for filling-in ... OK
Start integrating peak areas from original files
Requesting 56 peaks from spectra_emend.mzML ... got 11.
Requesting 155 peaks from spectra_zao.mzML ... got 31.
Requesting 0 peaks from spectra_zaoZero.mzML ... FAIL: no MS level 1 data available.
Error in x$.self$finalize() : attempt to apply non-function
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
Error in x$.self$finalize() : attempt to apply non-function

error: "spectra_zaoZero.mzML ... FAIL: no MS level 1 data available" The reported error is not valid because there is MS1 data in spectra_zaoZero.mzML: image

thanks

jorainer commented 1 year ago

OK, I will dig into the code.

Btw: the no MS level 1 data available might be misleading. It does not mean that no MS1 data is there, but that no peaks were present for the selected m/z and rt range. I will also see if/how I can make that message clearer.

jorainer commented 1 year ago

Indeed, there is a problem with empty spectra when creating the profile matrix (i.e. the binned matrix of intensities). Thus the fillChromPeaks fails because the creation of this profile matrix fails.

Your data has many empty spectra and thus this call failed. I have now added a fix (the same that is used in the MatchedFilter peak detection) that adds a fake 0 intensity, 0 m/z peak to each empty spectrum. With that you should not get any error anymore.

I'm however not sure if maybe simply removing empty spectra before running peak detection would be the cleaner solution? In your example above filterEmptySpectra actually fixed your problem. You can ignore the Error in self$finalize messages as they are not really a problem - and the message about the MS level is simply misleading. But you would need to check if the results you get are comparable or even better than those with the empty spectra.

Anyway, @felipe-mansoldo , you can install the version with the fix using:

BiocManager::install("sneumann/xcms", ref = "jomain")

Can you maybe please let me know if this fixes the problem?

felipe-mansoldo commented 1 year ago

Dear @jorainer Thank you very much for your help and attention! I installed this new version of xcms. This time the script worked perfectly! My best regards,

jorainer commented 1 year ago

thanks for letting me know! Then I'll merge into master.