Analysing LC-MS files with different mass windows

pablovgd commented 2 years ago

Dear all

TL,DR: XCMS seems to have trouble processing files that contain two different scan windows, workaround (splitting files per mass range) is a time sink, anybody got an idea on how to efficiently solve this?

I'm experiencing some trouble preprocessing LC-MS runs of lipidomics analyses (using both Thermo Q-Exactive and Thermo Exploris 120 instrumentation). The lipidomics method we are running contains two scan ranges:

mass range 67 - 1000, retention time 0 to 20 minutes
mass range 1000 - 2300, retention time 4.5 to 20 minutes The Thermo .raw files are converted into .mzXML files using MsConvert, with the "vendor filter" to create centroided data and subsetted per polarity. Raw chromatogram plot in xcms looks like this:

Raw chromatogram

After running centWave peak picking, only features with a mass < 1000 and a retention time < 4.5 are retained. I've tried playing around with centWave parameters, as well as optimizing them using the IPO package, but it seems to make no difference. Code used based on the xcms vignette (just example parameters here):

param = CentWaveParam(
ppm = 6, 
peakwidth = c(5,45), 
snthresh = 5,
noise = 2000, 
prefilter = c(3,1000))

xdata <- findChromPeaks(raw_data, param = param)
mpp <- MergeNeighboringPeaksParam(expandRt = 4)
xdata_pp <- refineChromPeaks(xdata, mpp)
xdata <- xdata_pp
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))
pdp <- PeakDensityParam(sampleGroups = groups,
                        minFraction = 0.4, bw = 30)
xdata <- groupChromPeaks(xdata, param = pdp)
xdata <- fillChromPeaks(xdata, param = ChromPeakAreaParam())

And result for an example sample : plotChromPeaks(xdata, file = 6) result features

Code based on older xcms version, written by a collegue of mine that is implemented in our data processing pipeline, which also yields the same problem:

xset <- xcmsSet(files = filenames, ppm=ppm, mzdiff=mzwid, noise=noise, snthresh=snthresh, 
            method="centWave", fitgauss=TRUE, nSlaves=8) #todo optimize param

xraw <- getXcmsRaw(xset)
xset2 <- retcor(xset, method="obiwarp", plottype = "deviation")
xset3 <- group(xset2, method="nearest")
xset4 <- fillPeaks(xset3)

reporttab <- diffreport(xset4, filebase = paste(path_data_out, "/diffreport_", name_project, sep=""), mzdec=4, eicmax=5000, metlin = FALSE)

A current workaround is splitting up the .raw files per polarity and mass range using a tool called "Recaloffline", but this is manual work and takes a lot of time: each .raw file needs to splitted up into four .mzXML files. Working with those files does seem to work (features over the complete RT time range: E.g. negative polarity, 67-1000 range: result features working

My excuses for the wall of text, I hope my explanation was clear enough to understand the problem. Thanks in advance for your time and help.

Kind regards Pablo

sneumann commented 2 years ago

Hi Pablo, great writeup, and a few comments:

Try to move on from mzXML to mzML at some stage, that should be the way forward
Do you have any of such files available ? Maybe in some metabolomics repository ? If so, a fully reproducible example of your above script working on that file would be great. Yours, Steffen

pablovgd commented 2 years ago

Hi Pablo, great writeup, and a few comments:

Try to move on from mzXML to mzML at some stage, that should be the way forward

Do you have any of such files available ? Maybe in some metabolomics repository ? If so, a fully reproducible example of your above script working on that file would be great. Yours, Steffen

Dear Steffen

Thanks for your quick response! We will definitly look into moving towards .mzML for our coming analyses, could it perhaps be part of the issue? I guess it is worth a try converting from .raw to .mzML and trying again. I've uploaded the files, with some explanation in the description to MassIVE: https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=e39220de2586409baf2824529dba3a68

I've added the script used to quickly test processing / make the plots below as well.

Thanks again. Pablo

#Test lipidomics LC-Ms processing 
library(xcms)
library(RColorBrewer)
library(pander)
library(magrittr)
library(pheatmap)
library(SummarizedExperiment)
library(MSnbase)
library(IPO)

#Load Files
path_data_in <- "C:/Users/pvgeende/OneDrive - UGent/Pablo Vangeenderhuysen PhD 2021-2022/Massaspectrometrie/Processing/Spliced raw files QE lipidomics 220204/negative - 67 - 1000"
path_data_out <- "C:/Users/pvgeende/OneDrive - UGent/Pablo Vangeenderhuysen PhD 2021-2022/Massaspectrometrie/Processing"
setwd(path_data_in)
datafiles <- list.files(path = path_data_in, pattern = ".mzXML")

raw_data <- readMSData(files = datafiles,
                       mode = "onDisk")

#Optimize parameters (skip for now)
peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')

resultPeakpicking <- 
  optimizeXcmsSet(files = datafiles[2:3], 
                  params = peakpickingParameters, 
                  nSlaves = 6, 
                  subdir = NULL,
                  plot = TRUE)

optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset

retcorGroupParameters <- getDefaultRetGroupStartingParams()
resultRetcorGroup <- optimizeRetGroup(xset = optimizedXcmsSetObject, params = retcorGroupParameters, subdir = NULL, nSlaves = 6)

 #inspect raw data

mzs <- mz(raw_data)

## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))

bpis <- chromatogram(raw_data, aggregationFun = "mean")

plot(bpis)

tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, 
        ylab = "intensity", main = "Total ion current")

  #Peak detection
param = CentWaveParam(
ppm = 6, #nstrum setup QE orbitrap ms: 10ppm
peakwidth = c(5,45), #instrum setup QE orbitrap ms: 15sec
snthresh = 5,
noise = 2000, #toek todo eval 10+4, 10+5, 10+6 for best balance speed/sensitivity

prefilter = c(3,100))

xdata <- findChromPeaks(raw_data, param = param)
mpp <- MergeNeighboringPeaksParam(expandRt = 4)
xdata_pp <- refineChromPeaks(xdata, mpp)
xdata <- xdata_pp

# 
# Summary_fun <- function(z)
#   c(peak_count = nrow(z), rt = quantile(z[, "rtmax"] - z[, "rtmin"]))
# 
# T <- lapply(split.data.frame(
#   chromPeaks(xdata), f = chromPeaks(xdata)[, "sample"]),
#   FUN = summary_fun)
# T <- do.call(rbind, T)
# rownames(T) <- basename(fileNames(xdata))
# pandoc.table(
#   T,
#   caption = paste0("Summary statistics on identified chromatographic",
#                    " peaks. Shown are number of identified peaks per",
#                    " sample and widths/duration of chromatographic ",
#                    "peaks."))

plotChromPeaks(xdata, file = 6)

xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.6))

bpis_adj <- chromatogram(xdata, aggregationFun = "max", include = "none")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis_adj)
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xdata)

pdp <- PeakDensityParam(sampleGroups = groups,
                        minFraction = 0.4, bw = 30)
xdata <- groupChromPeaks(xdata, param = pdp)

xdata <- fillChromPeaks(xdata, param = ChromPeakAreaParam())

res <- quantify(xdata, value = "into")

setwd(path_data_out)
write.table(featureDefinitions(xdata),"resultaat_xcms_new_neg_67_2300_QE.txt",sep = "\t")

jorainer commented 2 years ago

Thanks for providing the files - I will have a look into that.

jorainer commented 2 years ago

I had a look at the data now and think I found the reason for the failed peak detection after 4.5 minutes:

Looking at the m/z ranges for the scans after 4.5 minutes:

library(xcms)
fl <- "raw/Lipid_2_neg.mzXML"
data <- readMSData(fl, mode = "onDisk")

## Get m/z per spectra
mzs <- mz(data)

## Get the m/z range for each spectrum, display only the last 6
tail(lapply(mzs, range))
$F1.S1037
[1]  70.00452 974.29834

$F1.S1038
[1] 1153.137 2181.203

$F1.S1039
[1]  68.93566 910.40228

$F1.S1040
[1] 1016.067 2219.278

$F1.S1041
[1]  69.2184 784.0697

$F1.S1042
[1] 1017.196 2027.102

I assume the problem is that there are alternating spectra for low and high mass range (i.e. one scan with peaks from the low mass window followed by scan with data from the high mass window, then again one for low mass window and so on). centWave might thus fail to define ROIs because each time it starts defining one, the next scan does not have any peaks with similar m/z range.

I currently don't see a clean/good solution to fix/support such data directly in xcms. But what would be possible is to split the data directly in R - at least there would be no manual additional step required.

## Filter into low and high mass windows
data_low <- filterMz(data, mz = c(67, 1000))
data_high <- filterMz(data, mz = c(1000, 2300))

## Need to clean out empty spectra, otherwise we get the same problem
data_low <- filterEmptySpectra(data_low)
data_high <- filterEmptySpectra(data_high)

range(rtime(data_low))
[1]    0.975917 1199.760000
range(mz(data_low))
[1]  67.00109 999.77338
range(rtime(data_high))
[1]  272.473 1200.900
range(mz(data_high))
[1] 1000.006 2299.765

param <- CentWaveParam(ppm = 6, peakwidth = c(5,45), snthresh = 5, noise = 2000, prefilter = c(3,100))
xdata_low <- findChromPeaks(data_low, param)
xdata_high <- findChromPeaks(data_high, param)

plotChromPeaks(xdata_low)

plotChromPeaks(xdata_high)

Maybe that would not be the worst solution as I can imagine that also obiwarp could have issues with the alternating low/high mass window scans.

sneumann commented 2 years ago

... and mzML should not be a cause, data-wise it should be identical to mzXML. But it is the way forward :-)

pablovgd commented 2 years ago

Dear Johannes & Steffen

Thank you both for having a look at this, we have been trying to figure this out for quite a while! I think splitting the data directly in R seems like the best option for us for now, should be easy enough to "paste" the results/features together in a table afterwards.

Kind regards Pablo

jorainer commented 2 years ago

Great! I'm closing the issue now - feel free to re-open if needed.

sneumann / xcms

Analysing LC-MS files with different mass windows #608