sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
183 stars 80 forks source link

Fix peak mz #606

Closed jorainer closed 2 years ago

jorainer commented 2 years ago

This PR changes the way centWave calculates the m/z of a peak if within the ROI there are more than one mass peak per spectrum. In theory, for well centroided data, this should not happen but can happen sometimes.

The default was to simply calculate the mean m/z across all peaks per spectrum (retention times) within the ROI. This PR changes this and uses a intensity weighted mean instead of the plain mean. This should fix issue #590

jorainer commented 2 years ago

@sneumann happy to discuss this further. @stanstrup , would be happy to know if this indeed fixes your issue #590.

sneumann commented 2 years ago

I would love to have a script snippet I can run across a number of files, and get a summary (mean, median, quantiles, possibly rmsd) with/without the change. And a graphical plot of the ROIs with both before/after mzs as lines for the "worst" changing ones. I am happy to run that on numerous of our instrument data. Yours, Steffem

jorainer commented 2 years ago

I'll update this PR to (temporarily) allow switching between the old and new m/z calculation. Then it should be straight forward to compare/evaluate.

jorainer commented 2 years ago

@sneumann I have now added the possibility to choose the m/z calculation function (global option MZFUN). An example workflow how the impact can be evaluated:

Load test data and define the cent wave parameters. Note: I've used here ppm = 100 to actually start seeing an effect on the present data set. The bigger the ppm parameter the bigger the impact of the weighted mean is.

library(xcms)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)[1L]

data <- readMSData(fls, mode = "onDisk")
data <- data |>
    smooth(method = "SavitzkyGolay", halfWindowSize = 6) |>
    pickPeaks()
cwp <- CentWaveParam(peakwidth = c(2, 10), ppm = 100, integrate = 2)

Run default centWave:

options(MZFUN = "getMZ")
data_old <- findChromPeaks(data, cwp)

Use the new weighted average m/z calculation.

options(MZFUN = "getWeightedMZ")
data_new <- findChromPeaks(data, cwp)

Identify peaks with the largest impact:

mzdiff <- chromPeaks(data_old)[, "mz"] - chromPeaks(data_new)[, "mz"]

sum(mzdiff != 0)
idx <- head(order(abs(mzdiff), decreasing = TRUE))

Compare the effect: plot the peak with type "XIC" and indicate the peak's m/z value based on the original calculation (red) and new one (blue).

i <- 1L

a <- data_old |>
filterRt(chromPeaks(data_old)[idx[i], c("rtmin", "rtmax")] + c(-2, 2)) |>
filterMz(chromPeaks(data_old)[idx[i], c("mzmin", "mzmax")] + c(-0.02, 0.02))

plot(a, type = "XIC")
abline(h = chromPeaks(data_old)[idx[i], c("mz")], col = "red")
abline(h = chromPeaks(data_new)[idx[i], c("mz")], col = "blue", lty = 2)
A

With the standard mean the m/z of the peak is actually quite off (horizonal red line above). The blue dashed line represents the calculation using the weighted mean (closed to the real m/z values). Also, the peak region (red rectangle) seems to be too wide. The same plot for the peak region being defined also depending on the weighted mean:

b <- data_new |>
filterRt(chromPeaks(data_new)[idx[i], c("rtmin", "rtmax")] + c(-2, 2)) |>
filterMz(chromPeaks(data_new)[idx[i], c("mzmin", "mzmax")] + c(-0.02, 0.02))

plot(b, type = "XIC")
abline(h = chromPeaks(data_old)[idx[i], c("mz")], col = "red")
abline(h = chromPeaks(data_new)[idx[i], c("mz")], col = "blue", lty = 2)
B

The defined peak regtion is also better in this case.

So, summarizing: I think we should switch to use the weighted mean m/z calculation. For nicely centroided data there should not be big differences, but with less nice centroided data or larger ppm the weighted median should yield more accurate results.

jorainer commented 2 years ago

Just pinging you again @sneumann and @stanstrup - maybe you find the time to have a look into this ;)

stanstrup commented 2 years ago

Yes. I am so embarrassed for not getting back to you. I have just been slammed with running a course. But I am looking at it right now. Unfortunately I didn't keep track of my problematic file so I am trying to run an entire analysis on a dataset for the mass range for that example compound. Takes a while pulling the data at home. But I should be able to get back to you in a few days with something. I think your example already shows that it improves results.

Now that I think of it I will also try to find the orbitrap data we used for the artifact filter. I would be interested in seeing if this is a less aggressive solution to that problem too.

stanstrup commented 2 years ago

It looks to me like this is a very significant improvement. Here is an example run at 200 ppm where I have put the theoretical mass of the compound in green:

Old algorithm's windows: image

New algorithm's windows: image

The new algorithm gets it spot on and has a much more precise window.

I then checked across the entire dataset and it clearly shows that the new algorithm gets the mass right every time and the old one doesn't.

image

If I lower the ppm to something more normal like 30 ppm then I get practically identical results for this data with old and new algo but it really depends on the noise structure.

I took a look at the orbitrap file we use for the FFT filter example. I needed to increase the ppm to 50 ppm to get a difference for thisd peak but here the old algorithm would be ~40 ppm off due to the noise. I imagine that with the new algorithm you might not need the heavy-handed FFT artifact filter but could do something faster with peak picking following by filterChromPeaks (something like group by scpos and mz and keep max(maxo)) before grouping.

image

So as far as I can see the weighted mean:

I cannot see a scenario where this is not advantageous.

jorainer commented 2 years ago

Great! Thanks Jan for this extensive evaluation! Maybe if @sneumann could also have a look at it we could get this merged to get included into the next BioC release?

sneumann commented 2 years ago

Hi, I now got around to check ~17000 mzML files with our normal parameters (QTOF, 25ppm), and found 34 files with differences, everything else unchanged. Interestingly, all of them had just one peak differing, and 25 had a negative difference, and 10 a positive one. So the fix seems to work without major side-effects.

Screenshot from 2022-04-18 10-50-53Screenshot from 2022-04-18 10-49-20