sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
178 stars 81 forks source link

combineSpectra questions #575

Closed cbroeckl closed 2 years ago

cbroeckl commented 2 years ago

I have been playing with some DIA data, trying to reconstruct filtered MS/MS spectra. I walked through the MS/MS DDA tutorial, and have played with some of my own data.

Question 1: Is there a way to subtract spectra? i.e. is there a function to subtract the signal from MS/MS spectrum A from MS/MS spectrum B? Alternatively, take a ratio?

Question 2: The combineSpectra function does not seem to behaving as i would expect. In particular the intensityFun aggregation. I noticed this clearly on some in-house data, but rather than try to share that, you can you use the tutorial data. Once you get to the combineSpectra function for the 5 DDA spectra:

ex_spectrum.sum <- combineSpectra(ex_spectra, FUN = combinePeaks, ppm = 20, peaks = "intersect", minProp = 0.8, intensityFun = sum, mzFun = median, f = ex_spectra$peak_id)

ex_spectrum.mean <- combineSpectra(ex_spectra, FUN = combinePeaks, ppm = 20, peaks = "intersect", minProp = 0.8, intensityFun = mean, mzFun = median, f = ex_spectra$peak_id)

In theory, these two spectra, sum and mean, should be identical except for some scaling factor based on n = 5. However, they are not. There are small differences in the relative abundances.

image image

Look at the ratio of the two highest-mass ions, for example.

It is considerably worse in my in-house data, which is using waters SONAR - many more spectra are being combined here: image top is 'sum' and bottom is 'mean'

Any idea why this is the case?

Thanks! Corey


> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_2.0.1      pander_0.6.4        Spectra_1.2.0       BiocStyle_2.20.2    gridExtra_2.3       ggplot2_3.3.5      
 [7] RAMClustR_1.2.1     xcms_3.14.0         MSnbase_2.18.0      ProtGenerics_1.24.0 S4Vectors_0.30.0    mzR_2.26.1         
[13] Rcpp_1.0.7          Biobase_2.52.0      BiocGenerics_0.38.0 BiocParallel_1.26.1

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2            ellipsis_0.3.2              class_7.3-19                dynamicTreeCut_1.63-1      
  [5] rprojroot_2.0.2             XVector_0.32.0              GenomicRanges_1.44.0        fs_1.5.0                   
  [9] clue_0.3-59                 rstudioapi_0.13             proxy_0.4-26                farver_2.1.0               
 [13] remotes_2.4.0               affyio_1.62.0               fansi_0.5.0                 codetools_0.2-18           
 [17] ncdf4_1.17                  doParallel_1.0.16           cachem_1.0.5                impute_1.66.0              
 [21] robustbase_0.93-8           knitr_1.33                  pkgload_1.2.1               jsonlite_1.7.2             
 [25] cluster_2.1.2               vsn_3.60.0                  BiocManager_1.30.16         compiler_4.1.0             
 [29] assertthat_0.2.1            Matrix_1.3-4                fastmap_1.1.0               limma_3.48.1               
 [33] cli_3.0.1                   htmltools_0.5.1.1           prettyunits_1.1.1           tools_4.1.0                
 [37] gtable_0.3.0                glue_1.4.2                  GenomeInfoDbData_1.2.6      affy_1.70.0                
 [41] RANN_2.6.1                  dplyr_1.0.7                 MALDIquant_1.19.3           vctrs_0.3.8                
 [45] preprocessCore_1.54.0       iterators_1.0.13            xfun_0.24                   fastcluster_1.2.3          
 [49] ps_1.6.0                    testthat_3.0.4              enviPat_2.4                 lifecycle_1.0.0            
 [53] gtools_3.9.2                devtools_2.4.2              XML_3.99-0.6                DEoptimR_1.0-9             
 [57] zlibbioc_1.38.0             MASS_7.3-54                 scales_1.1.1                pcaMethods_1.84.0          
 [61] MatrixGenerics_1.4.0        SummarizedExperiment_1.22.0 MassSpecWavelet_1.58.0      RColorBrewer_1.1-2         
 [65] yaml_2.2.1                  curl_4.3.2                  memoise_2.0.0               desc_1.3.0                 
 [69] foreach_1.5.1               e1071_1.7-7                 caTools_1.18.2              pkgbuild_1.2.0             
 [73] GenomeInfoDb_1.28.1         rlang_0.4.11                pkgconfig_2.0.3             matrixStats_0.60.0         
 [77] bitops_1.0-7                mzID_1.30.0                 evaluate_0.14               lattice_0.20-44            
 [81] InterpretMSSpectrum_1.2     purrr_0.3.4                 labeling_0.4.2              bit_4.0.4                  
 [85] processx_3.5.2              tidyselect_1.1.1            plyr_1.8.6                  R6_2.5.0                   
 [89] IRanges_2.26.0              gplots_3.1.1                generics_0.1.0              DelayedArray_0.18.0        
 [93] DBI_1.1.1                   pillar_1.6.2                withr_2.4.2                 MsCoreUtils_1.4.0          
 [97] RCurl_1.98-1.3              tibble_3.1.3                crayon_1.4.1                KernSmooth_2.23-20         
[101] utf8_1.2.2                  rmarkdown_2.9               usethis_2.0.1               grid_4.1.0                 
[105] callr_3.7.0                 digest_0.6.27               ff_4.0.4                    Rdisop_1.52.0              
[109] munsell_0.5.0               sessioninfo_1.1.1    
cbroeckl commented 2 years ago

One more comment. I was using combineSpectra from msnbase, not xcms. I am assuming they share code though!

jorainer commented 2 years ago

Hm, the problem is that both MSnbase and Spectra define a combineSpectra function - that actually work slightly different... to be really on the safe side to use the more powerful function from the Spectra package you could call it directly with Spectra::combineSpectra - otherwise it will depend on the order you load the packages which one will be used.

Next question, with the tutorial you mean this one, right? Just to be really sure which code to look for and where to add fixes or additional features... (again, I would prefer to do this in the Spectra package, which is also used in the LC-MS/MS xcms vignette)

cbroeckl commented 2 years ago

@jorainer - my apologies. i realized i was using two different functions late in the game. i had been using msnbase, observed the discrepancy between the performance of sum and mean on my data, then went to try to make a reproducible example from the Vignette. I received error messages trying to use the Spectra::combineSpectra function having parsed data with MSnBase, so i will need to back up to do this more correctly.

you did identify the correct vignette though. When you get to this code snippet: ex_spectrum <- combineSpectra(ex_spectra, FUN = combinePeaks, ppm = 20, peaks = "intersect", minProp = 0.8, intensityFun = median, mzFun = median, f = ex_spectra$peak_id)

just replace 'intensityFun = median' with 'intensityFun = mean' or 'intensityFun = sum'

More later, but the mean and sum from Spectra::combineSpectra do look slightly different in their relative abundances. maybe this is just rounding? I will have to see whether the larger differences that i observed on my own data are observed using the Spectra::combineSpectra function.

cbroeckl commented 2 years ago

Also - question does remain as to whether there are functions for subtracting a spectrum from another. Alternatively, is it possible to align spectra without combining them? I could then take the @mz and @intensity data from each, do whatever math i want, and recreate a spectra-class object. This way i would still be using the Spectra alignment tools, at least.

cbroeckl commented 2 years ago

@jorainer - i am a bit stuck. I can't seem to figure out how to get Spectra::combineSpectra to work using the filtering i am using, though i can get the MSnbase::combineSpectra to work. I use the SWATH demo data as an example:


> ## swath demo data to demonstrate issue
> swath_file <- system.file("TripleTOF-SWATH",
+                           "PestMix1_SWATH.mzML",
+                           package = "msdata")
> swath_data <- MSnbase::readMSData(swath_file, mode = "onDisk")
> swath_data <- Spectra::filterRt(swath_data, rt = c(310, 312))
> table(Spectra::msLevel(swath_data))

 1  2 
 2 18 
> 
> ms2 <- Spectra::filterMsLevel(swath_data, 2)
> ms2
MSn experiment data ("OnDiskMSnExp")
Object size in memory: 0.04 Mb
- - - Spectra data - - -
 MS level(s): 2 
 Number of spectra: 18 
 MSn retention times: 5:10 - 5:12 minutes
- - - Processing information - - -
Data loaded [Fri Jul 30 13:49:22 2021] 
Filter: select retention time [310-312] and MS level(s), 1 2 [Fri Jul 30 13:49:22 2021] 
Filter: select MS level(s) 2. [Fri Jul 30 13:49:22 2021] 
 MSnbase version: 2.18.0 
- - - Meta data  - - -
phenoData
  rowNames: PestMix1_SWATH.mzML
  varLabels: sampleNames
  varMetadata: labelDescription
Loaded from:
  PestMix1_SWATH.mzML 
protocolData: none
featureData
  featureNames: F1.S3100 F1.S3101 ... F1.S3119 (18 total)
  fvarLabels: fileIdx spIdx ... spectrum (35 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
> 
> ## the combineSpectra function from the Spectra package doesn't work, apparently
> ms2.cmb <- Spectra::combineSpectra(ms2)
Error in Spectra::combineSpectra(ms2) : 
  length of 'f' and 'p' have to match length of 'x'
> 
> ## defining f and p factor vectors generates a new error message - not sure 
> ## what this signifies though.  
> ms2.cmb <- Spectra::combineSpectra(ms2[1:18],
+                                    f = as.factor(rep("A", 18)),
+                                    p = as.factor(rep("A", 18))
+ )
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘isReadOnly’ for signature ‘"character"’
> 
> 
> ## this is what i see when applying the MSnbase::combineSpectra - same as i observed
> ## for my internal data - vastly different spectra, when i would expect different
> ## (by a factor of nSpectra) y-axis values only. 
> ms2.cmb.mean <- MSnbase::combineSpectra(ms2, ppm = 25, intensityFun = mean)
> ms2.cmb.sum  <- MSnbase::combineSpectra(ms2, ppm = 25, intensityFun = sum)
> 
> pl.mean <- plot(ms2.cmb.mean[[1]], plot = FALSE) + ggtitle("mean intensity")
> pl.sum <-  plot(ms2.cmb.sum[[1]], plot = FALSE) + ggtitle("sum intensity")
> gridExtra::grid.arrange(pl.sum, pl.mean, nrow = 2)

image

I do realize that this may be better reported at MSnbase as it stands. However, I do want to figure out whether i see different behavior using the Spectra package, and would appreciate some guidance. i am sure this is a relatively simple issue. Thanks. Corey

jorainer commented 2 years ago

Thanks for the use case @cbroeckl ! I'll try to follow it and fix the errors below. Firstly, you can not apply the functions from the Spectra package if you read the data with the MSnbase readMSData function. You should either import them directly using Spectra or use, as in the xcms LC-MS/MS vignette, return.type = "Spectra" in the e.g. chromPeakSpectra function from xcms. That way you have the data as a Spectra object and you can use the Spectra::combineSpectra - if you have the data imported with MSnbase and have it as an MSnExp object or MSpectra, you can only use the MSnbase::combineSpectra function. If possible, I would use in future wherever possible the Spectra objects, as we have more powerful functions for them - and will most likely not add or update any functionality to MSnbase (unless its a bug fix).

jorainer commented 2 years ago

That said, I try to run your workflow above with Spectra:

library(Spectra)
swath_file <- system.file("TripleTOF-SWATH", "PestMix1_SWATH.mzML", package = "msdata")
swath_data <- Spectra(swath_file)
swath_data <- filterRt(swath_data, rt = c(310, 312))
ms2 <- filterMsLevel(swath_data)
ms2
MSn data (Spectra) with 20 spectra in a MsBackendMzR backend:
      msLevel     rtime scanIndex
    <integer> <numeric> <integer>
1           2   310.070      3100
2           2   310.167      3101
...       ...       ...       ...
19          2   311.870      3118
20          2   311.967      3119
 ... 33 more variables/columns.

file(s):
PestMix1_SWATH.mzML
Processing:
 Filter: select retention time [310..312] on MS level(s) 2 1 [Mon Aug  2 12:40:57 2021]
 Filter: select MS level(s)  [Mon Aug  2 12:41:21 2021] 

OK, and now let's use combineSpectra grouping all peaks from all spectra with each other that haven an m/z difference smaller 20ppm. We use either sum or mean to aggregate the intensities.

ms2_sum <- combineSpectra(ms2, intensityFun = sum, ppm = 20)
ms2_mean <- combineSpectra(ms2, intensityFun = mean, ppm = 20)

## Plot the results
par(mfrow = c(2, 1), mar = c(4, 4.3, 1, 0.5))
plotSpectra(ms2_sum)
grid()
plotSpectra(ms2_mean)
grid()
Untitled

to me that looks correct, all the main peaks seem to be there and the sum has higher intensities than the mean.

jorainer commented 2 years ago

Now, regarding your question with the difference of intensities... that should be doable. With intensityFun we can actually define any function we want to aggregate the intensities of the peaks from different spectra that are matched (based on their m/z). It's just important that a single value is returned for each grouped peak.

Just using diff would not work because of that reason:

## Subset to only 2 spectra:
ms2_2 <- ms2[1:2]
ms2_2diff <- combineSpectra(ms2_2, intensityFun = diff, ppm = 20)
Backend of the input object is read-only, will change that to an 'MsBackendDataFrame'
Error in vapply(X = X, FUN = FUN, FUN.VALUE = NA_real_, ..., USE.NAMES = USE.NAMES) : 
  values must be length 1,
 but FUN(X[[1]]) result is length 0

We could maybe create our own function that subtracts the intensities from each other and returns NA if there is a single intensity value (i.e. peak only found in one spectrum) or there are more than 2 intensity values (e.g. if two peaks from one spectrum and one or more from the other where grouped).

mydiff <- function(x) {
    if (length(x) == 2)
        x[2] - x[1]
    else NA_real_
}
ms2_2diff <- combineSpectra(ms2_2, intensityFun = mydiff, ppm = 20)

plotSpectra(ms2_2diff, ylim = range(intensity(ms2_2diff)[[1L]], na.rm = TRUE))

Note: there seems to be a bug in plotSpectra that does not expect negative intensity values - that's why we need to specify the ylim ourselfs.

Untitled

You see, in theory combineSpectra should be flexible enough to do what you suggest. Also, maybe have a look at the documentation of ?combinePeaks which is the function used internally to do the actual combining.

cbroeckl commented 2 years ago

@jorainer

Many thanks - this is a great series of responses. A few replies:

  1. i appreciate the guidance on MSnbase vs Spectra - i find it difficult to keep up with all the developments, and this helps me to refocus efforts where they should be focused.
  2. Thanks for the guidance on combineSpectra custom functions and i will look more into the combinePeaks documentation for clarity.
  3. I may be misunderstanding the details of 'sum' and 'mean' as they are being applied, but in my mind the two spectra for sum and mean should look identical with the exception of the y-axis values - sum being n-times higher than mean, with n spectra. The fact that there is a difference between sum and mean suggests to me that if there are n spectra, the mean is being calculated pairwise. That is, take the mean of spectra 1 and 2, then the mean of (mean1,2) and spectrum 3, then the mean of spectrum(1,2,3) and spectrum 4, etc. THis would probably result in a different result for each case where using the same n spectra, but in different order, the resulting spectrum differs. I can't say that this is a major problem, but i could see how it generates vastly different results in a somewhat random manner, which is rather less than intuitive. I will play with it some more, using the Spectra functions, and see what i can find.
    Thanks again.
jorainer commented 2 years ago

1) I know, that's tricky. I'll try to update the vignettes (like for xcms) whenever new functionality becomes available. Most development efforts go now into Spectra and alike but in the end I try to integrate as much as possible with the old packages as possible (or have a way to convert old data objects into the new data structures to allow usage of the new code).

2) if something is not working we could implement alternative functions too - we would just need a nice use case.

3) yes, the sum and mean spectra should identical, but only if the individual spectra are also identical. For the example above it seems that some peaks where present in most spectra, but some not, thus the mean and sum resulted in differences. What the function does is pretty simple: it gets all peaks from all spectra and orders them by m/z. Then it groups all peaks (based on m/z similarity considering the provided tolerance and ppm) and aggregates the m/z and intensity values in each such defined peak group. So, there is no pairwise calculation in there, so the result should be correct (I believe).

cbroeckl commented 2 years ago

Follow up on (3). So if we have three spectra, and each spectra has m/z = 195.088, the signal intensity for it is calculated as the sum of the signal intensities / three. However, if the 195.088 is present in only two of the three spectra, then the sum is calculated and divided by two, as there were only two measured values. Accurate?

I had been thinking of the divisor for the mean as always being the number of original spectra. I think you are treating the absent mass/intensity as a missing value to be ignored, while i was thinking about it as a 'zero' value intensity at that mass. If we treat missing values as zeros, then the divisor is always the number of spectra.

I am not sure whether one is more appropriate than the other - i imagine that it depends on the context, to some extent.

jorainer commented 2 years ago

correct, the way it is implemented we only take the mean on the present values and if in one or more spectra the peak was not present this is not included in the calculation. We could think of an alternative implementation if this would be important for you. I will however definitely update the documentation to explain that better. Thanks Corey!

cbroeckl commented 2 years ago

Thanks @jorainer - i think it is fine as is for now - if i need the mean calcluated otherwise i can always write the function to divide the sum by nSpectra for each mass. I just misunderstood the calculations.