Problem with centWave peak detection on peaks with discontinuous signal

sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis

Other

181 stars 80 forks source link

Problem with centWave peak detection on peaks with discontinuous signal #315

Closed jorainer closed 5 years ago

jorainer commented 5 years ago

This is @zizekw original comment in issue #290:

@jotsetung, thanks for the insight within this thread, I've been able to learn a lot about some issues I was observing in my own work.

Building off of what you described, I had performed CentWave peak detection and upon batch visualizing dozens of runs, I noticed I had some discontinuous lines on my chromatograms as was shown exactly in above examples. However, going a little deeper, this discontinuity in my data is actually impacting my peak detection. Calling clean on my individual Chromatogram objects with na.rm = TRUE and then re-calling findChromPeaks solves my issue, but are you aware of any way to remove these NA values prior to an initial call of findChromPeaks? I would ideally like to minimize the amount of calls to findChromPeaks and just detect these peaks outright since I am dealing with large amounts of data. Calling clean on my MSnExp object does not seem to solve things.

My CentWaveParam settings work for ~95% of my peaks (hundreds of runs), but upon visualization with the aid of highlightChromPeaks, the discontinuous peaks can not be detected no matter how much fiddling I did with the parameters. Calling clean and saving as a new Chromatogram object and then re-calling findChromPeaks which does indeed detect the peaks this second time around. However, I am returned a matrix object and can not now utilize my existing custom batch plotting functions (including the use of highlightChromPeaks) as before. Further, because I have to call it per file, I would have to add extra annotation steps to reincorporate these separately detected peaks with the rest of the experimental data. Let me know if I can provide any further information for context or clarify my comments.

Apologies...is this question better suited for MSnbase @lgatto?

Originally posted by @zizekw in https://github.com/sneumann/xcms/issues/290#issuecomment-422902773

jorainer commented 5 years ago

@zizekw, I'm a little lost here: so you run centWave on an OnDiskMSnExp, for detected peaks you extract the XIC as a Chromatogram and you then re-run centWave on that?

Note that centWave on OnDiskMSnExp and centWave on Chromatogram objects are slightly different: the background signal detection for OnDiskMSnExp objects uses in most cases the full data slice (full retention time range), while the one on Chromatogram objects has to guess the background signal on only the signal it has within the Chromatogram object.

would be really nice if you could provide some examples here

lgatto commented 5 years ago

@jotsetung - feel free to tag me here if necessary.

zizekw commented 5 years ago

I have tailored my workflow to try and match that of the workflow as described in the vignette. Note that I am not a developer and my coding skills are rudimentary, I am more of an "application" user. Note also that my group is not using LC/GC for separation but are still using centroid TOF-MS data.

The current CentWave parameters I use are suitable for ~95% of my data. The following issues I describe are not present in the majority of my data. The following images represent when things do not work properly - likely due to the scans from when the data was acquired (NAs that are present?). This issue is present in my data for several compounds among different runs.

When I perform peak detection on all of my data, I simultaneously just wrote a simple function to batch plot all my runs for Internal Standards to make sure everything is behaving before proceeding.

Here is an example of an XIC of a problematic run:

example 3

For this compound peak detection went awry:

example 2

And without calling HighlightChromPeaks:

example 1

So downstream of course these peaks are missing from the experiment as a result of this. I am unable to resolve this issue by smoothing, calling clean on the data when I initially read it in to R, or adjusting my CentWave parameters.

However in the referenced thread, I read @jotsetung's comment and tried out something he described. So when I save the detected peaks thus far as a chromatogram object and call clean with na.rm = TRUE and re-plot the data I see the following:

example 4

This is a lot better clearly, but further to this it matches what I observe for the same data file within Agilent MassHunter. Awesome. Now again thanks to Johannes' comment, I called findChromPeaks with the same CentWave parameters as before and can now detect the peaks that previously were missed. (Plot not shown because this second peak detection stores the result as a matrix)

My question is that is it possible to perform this clearing of NA values (which I'm assuming are the suspect at this point?) at an earlier step so that I can just perform one call of findChromPeaks versus calling it twice as described above? To reiterate , calling clean on my OnDiskMSnExp did not have any impact versus the successful effect calling on the Chromatogram object.

Sorry for the initial confusion, hope this is a little more clear. Let me know if I can expand, thanks for the support. Really impressive work by the teams for xcms, MSnbase and others!

A note that might be relevant is that I do have a trimMZ step to filter out extreme points in the data and did notice getting several messages similar to this when plotting problematic XICs: Warning messages: 1: In trimMz_Spectrum(object, mzlim = mz, msLevel. = msLevel., ... : No data points between 141.1395 and 141.1475 for spectrum with acquisition number 650341. Returning empty spectrum.

jorainer commented 5 years ago

Thanks for the clarification! Some thoughts:

The scattering of m/z values in consecutive scans is quite large in the XIC plot above. I'd suggest that you use a relatively large ppm in centWave (something around ppm = 30, the scattering in the plot above is ~ 28ppm).
Along the lines - we have Sciex data and centroided our files with Proteowizard. The results were not that good and we had also lots of peaks with discontinuous signal, because the centroids had such a large scattering in m/z. Our solution: export the mzML in profile mode and do the centroiding using MSnbase: specifically, we are now first smoothing the signal in retention time dimension (using a moving window approach to average the signal from 3 consecutive spectra; this can be done with the combineSpectraMovingWindow function in MSnbase), then we smooth the data in m/z (per spectrum) with a Savitzky Golay filter (smooth function in MSnbase), and perform then the centroiding with MSnbase's pickPeaks function where we in addition refine the centroid's m/z by an intensity-weighted mean (parameters refineMz = "descendPeak", signalPercentage = 33). This is done on a per-file-basis and we save then the centroided data to an mzML file with writeMSData. You can also have a look at https://jotsetung.github.io/metabolomics2018/xcms-preprocessing.html for some information on the centroiding (the Rmd can be found here https://github.com/jotsetung/metabolomics2018).
Regarding the clean on the Chromatogram, I agree, the data looks better afterwards but is in my eyes not representing the raw data. Removing the NAs basically connects simply all data points, and the signal that is missing in the one spectrum get's imputed/interpolated. I have a bad gut feeling about that. Actually, in the respective spectrum there is a signal present, it's just at a m/z outside of the m/z range you are looking at. That's why I would rather suggest to increase the ppm to tell centWave to include also signal into the region of interest (where it looks for a peak) that is a little outside - hoping that it can detect the peak then.
The clean,OnDiskMSnExp is not working as you would like it to. clean,Chromatogram removes NAs in the data within the chromatogram. It doesn't know about the signal that is outside of the m/z range of the chromatogram you extracted. Effectively, removing NAs in the Chromatogram is like removing a spectrum from your data! And that's why it is not working on the OnDiskMSnExp, to get the same result, the clean would have to remove the whole spectrum, which it does not, because the spectrum contains data, just not for the m/z range you're looking at in your extracted chromatogram.

I hope I could clarify some things a little bit. If everything fails, you could still use the fillChromPeaks function to rescue the signal for the missed/failed peak detections...

zizekw commented 5 years ago

Thanks for taking the time for such a detailed response @jotsetung.

Spot on. When I ran IPO it suggested ppm = 27.5 but I have been using ppm = 30 lately.
That's interesting. I utilize Proteowizard for conversion from .d to .mzXML. Initially I was centroiding my data there as well but I wasn't fully sure what/how it was doing the process after poking around in some documentation. What I currently do is that with Agilent's data you can actually just go in and delete MSProfile.bin and it's the exact same result of centroiding using Proteowizard. Thanks for detailing the process for what your team does, I likely will try something similar as I feel that might have a positive impact on this issue. I am curious, is that process amenable for scaling to large metabolomics studies? Profile data of course has a lot large file sizes than centroid so that is why I initially had hesitations about doing something similar. Further to this, I was unable to apply combineSpectraMovingWindow to my data as it seemed to be quite computationally intensive. Yes, I wasn't able to attend that conference but my PI in fact attended and I did a lot of my learning about xcms over the past few months based on that .Rmd!
Fair point. For our purposes, this example would likely be acceptable, but I can imagine there might be cases where this would go quite wrong. It seems like more of a "band-aid" fix versus the above approach.
Ah, makes sense.
I definitely agree about the potential utility that fillChromPeaks would have in this instance. I wasn't able to get grouping to work in some early attempts, possibly due to the different nature of my data versus LC or GC separation. As you can see from the representative XIC, each of those peaks belong to the same compound so when I was grouping between several runs things went awry. If you have any suggestions for either getting grouping to work or customizing fillChromPeaks for my data, I would be happy to test them out as this is a pretty high priority for my lab. Right now undergraduate students manually integrate these peaks one at a time in MassHunter which is...not an ideal approach lol.

Learning more and more about xcms everyday, thanks for your ongoing support. The vignettes and this page have been tremendous resources.

zizekw commented 5 years ago

@jotsetung Things are looking much better and are getting a lot more computationally intensive haha. Here's some raw profile data with the same example data:

example profile

I then filtered RT and m/z specifically for the compound before calling combineSpectraMovingWindow:

example pickpeaks-descendpeak refinement on combinespectramovingwindow

Now, an average file size of one of my profile .mzXML data files is ~12 GB. In terms of workflow, I read in each of my files (did around 30) separately and tried running the following code snippet per data file before calling writeMSData.

temp <- get(file_name) %>% #combineSpectraMovingWindow() %>% smooth(method = "SavitzkyGolay") %>% pickPeaks(refineMz = "descendPeak", signalPercentage = 33)

Error: cannot allocate vector of size 5.1 Mb

Now, I have 16 GB of RAM (confirmed by memory.limit()) and am running 64-bit R. Based on some research I am gathering that this error message is due to lack of RAM availability when trying to call combineSpectraMovingWindow on a full data file with the solution being to get more RAM, can you confirm this? If simply buying more RAM will help resolve this it can be done but I wanted to get some more input. Initially, I thought I was getting this error because I was calling readMSData on all of my experiment files but I appear to get it even with individual files. Let me know what you think!

jorainer commented 5 years ago

Sorry for my late reply - have been busy but now I try to catch up - first with the previous post:

your last point in the previous post, regarding getting grouping to work: the plotChromPeakDensity function can be quite helpful there. You could use that and tune your bw setting so that the peaks are included into the same feature. Note however that you will have problems if other compounds do not generate such signals. Then, if you use featureValues you would like to use the option method = "sum" which sums the signal of all peaks assigned to a feature within the same sample - the default is, I think, to select only the abundance of the largest peak. Now, if that is no option for you, you would have to combine features intensities manually afterwards, e.g. if they are closer in retention time than a certain cutoff. There is no function (yet) in xcms to do that.

Now, regarding the feasibility of centroiding in R (which relates also to your other post): it should be doable. I'm about to do this on 10000 files and yes, the amount of disk space required is large - I'll need 50TB for the profile mzML data (but I will not keep them after centroiding). I'll use the following R script:

## Script to perform centroiding of profile-mode mzML files from a
## folder.

## Directory where the profile mzML files can be found (can contain sub-folders
## etc.
in_dir <- "~/data/test_data/"

## Pattern in the full file path from the original input files that should be
## replaced with "path_replace".
path_pattern <- "/data/"
path_replace <- "/tmp/"

## Log directory. Timings and output from the individual processes will be
## stored there.
log_dir <- "~/log/centroiding/"

## Number of CPUs to use in parallel. Takes by default the number of nodes
## specified for the slurm job.
ncores <- as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE", 4))

## Script starts here.
library(MSnbase)
library(BatchJobs)
register(bpstart(MulticoreParam(ncores, log = TRUE, logdir = log_dir)))

if (!dir.exists(log_dir))
    dir.create(log_dir)

fls_in <- dir(in_dir, pattern = "mzML", full.names = TRUE, recursive = TRUE)
fls_out <- sub(pattern = path_pattern, replacement = path_replace, fls_in)

## Remove those that do already exist.
fls_exst <- file.exists(fls_out)
fls_in <- fls_in[!fls_exst]

centroid_one_file <- function(z, pattern, replacement, fixed = TRUE) {
    outf <- sub(pattern = pattern, replacement = replacement, z, fixed = fixed)
    outd <- dirname(outf)
    if (!dir.exists(outd))
        dir.create(outd, recursive = TRUE)
    tmp <- combineSpectraMovingWindow(
        readMSData(z, mode = "inMem", msLevel = 1)
    )
    suppressWarnings(
        tmp <- pickPeaks(smooth(tmp, method = "SavitzkyGolay",
                                halfWindowSize = 6L),
                         refineMz = "descendPeak", signalPercentage = 33)
    )
    writeMSData(tmp, file = outf, copy = TRUE)
}

bplapply(fls_in, centroid_one_file, pattern = path_pattern,
         replacement = path_replace)

It essentially checks the input directory in_dir and finds all mzML files there, then checks if they do already exist in the output directory (simply in_dir with pattern path_replace replaced by path_replace) and removes them in case. Then it runs ncores calls in parallel, where I first read the profile mode mzML (into memory), combine the spectra, smooth the signal, do the centroiding and export the mzML file again. I use 8GB memory per job and that suffices for my files.

To check if you would need more memory, simply read one of your files with readMSData and mode = "inMem" and then call combineSpectraMovingWindow (or actually, run the full centroiding on one file). If you don't run into problems there you should have enough memory.

SiggiSmara commented 5 years ago

Just a note on the memory question, and I don't know if this is true in your case @zizekw but I have in the past run into memory problems when running xcms / MSnbase within RStudio that are not seen when R is run directly from the command line. But you should get a good feeling for the memory you are using if you follow the suggestion of @jotsetung and load one file in memory. Before running any other code if you look at how much memory R uses when one file is read with the "inMem" mode will give you a pretty good idea if this is the problem or not.

zizekw commented 5 years ago

@SiggiSmara True, good point. I will try that suggestion out and hope that is helps, thanks!

@jotsetung Super helpful, I appreciate you sharing that thanks. It's a lot cleaner compared to what I was attempting! Unfortunately the result is the same...

Error: cannot allocate vector of size 4.0 Mb

It seems simply calling memory.size() with any extraneous applications closed on my workstation and just RStudio running returned ~ 500 Mb.

Then, when I simply read in one .mzXML file (10 GB), memory.size() returns ~12-13 GB (is this expected?). Either way, a subsequent call of combineSpectraMovingWindow() then likely pushes the memory over the top (memory.limit() = 16268) resulting in that error message. Not an expert but this is what I suspect must be causing this error...

I think this might be a good excuse to submit a request to my supervisor for a more (likely over) powerful workstation that I can run Linux on versus Windows. I am gathering that Linux would also allow for some more speed in other aspects of the workflow.

All of this support has likely resolved my initial issue, as soon as I can get my hands on some more RAM and confirm this was the source of the error I think we can close this issue. Thanks again for the support.

jorainer commented 5 years ago

Yes, the size of one 10GB mzML/XML file can well require about 12-13GB of memory, remember that the intensity values in the mzML file format are usually compressed, while in memory they will be represented by floating point numbers.

Regarding Windows vs Linus - I would always go for Linux. More memory can be addressed and mapped by Linux and is simply more scalable.

zizekw commented 5 years ago

Still waiting to get my hands on a more powerful workstation. I managed to steal an extra 4 GB of RAM (20 GB total) but still hit the error.

Error: cannot allocate vector of size 3.0 Mb

Hopefully will have some solid state memory + a stronger CPU alongside a healthy amount of RAM sometime soon and will update further.

zizekw commented 5 years ago

Hi @jotsetung,

Apologies for the long overdue follow-up.

First of all, I think this issue can technically be closed as it is clear what the issue is and it is on my end: over-centroiding of the data. When centroiding, essentially too much of the signal is removed for some compounds that it impairs the quality of my data. I've learned a lot about xcms and related packages from trying to troubleshoot this, however I still need to resolve this so I would appreciate any insight despite narrowing down the root cause.

I now have a much better CPU and 32 GB of RAM with some solid state memory. Not an ideal hardware setup, but trending in the right direction. Also, I am now working with TOF-MS data versus QTOF-MS - so my profile .mzXML files are now ~3.5 GB each versus a burdensome ~12 GB each lol. I have now had the opportunity to evaluate your above suggestions on an experiment level versus just one data file.

I follow the workflows and guidance from the available vignettes + presentations. So I begin by centroiding and smoothing all my data (m/z dimension) for an experiment as per the above suggestions. Unfortunately, I can't smooth in the retention time dimension using combineSpectraMovingWindow() due to it being too computationally intensive for my workstation and running out of RAM.

pickPeaks(smooth(temp, method = "SavitzkyGolay", halfWindowSize = 3L, polynomialOrder = 6L))

Then, I detect peaks using CentWave.

CentWaveParam(ppm = 20, peakwidth = c(5,35), snthresh = 1, prefilter = c(2, 500), integrate = 2L, mzdiff = 0.002, fitgauss = FALSE, noise = 100, verboseColumns = TRUE)

In some cases I get perfect results: F-Phe

However, a common problem I run into with a significant amount of compounds is the following:

Upon investigating, the likely cause of this might not be the CentWave peak detection but in fact the initial centroiding.

Here is the XIC of the centroided file for the above data:

And the XIC of the profile file for the same data:

Any thoughts on the above? Specifically, do you have any advice on how to reduce the significant data loss of centroiding? Would running refineMz with "descendPeak", signalPercentage = 33 assist with this particular issue or does that only impact the reported m/z values? Otherwise the last-resort approach that remains is working with the profile data and using MatchedFilter detection - but I'd like to avoid that if possible.

I appreciate any further assistance and please let me know if I can clarify anything. Thanks!

P.S. The centroid data files by Agilent for this same data are even worse! So I'd definitely recommend centroiding using xcms and related packages versus trusting Agilent for future reference.

jorainer commented 5 years ago

Hi @zizekw ! Nice comprehensive description! And good to hear that also for Agilent centroiding can be improved.

Firstly, the signal intensity of the problematic peak is relatively low - so its quite common that you will have scattery signal there.

refineMz might actually help a little. Yes, it is just refining the m/z of the centroid based on an intensity-weighted average, but this can bring the individual data points for a peak closer together which helps centWave in the peak detection. In other words, in the XIC of the centroided data of the ~ 1200 signal intensity peaks: the m/z values for some of the signal of a peak are at ~ 166.170 and some are at ~ 166.165. refineMz can help to bring these closer together in m/z and the chance that centWave uses these in the peak detection is more likely. In addition, you could also increase the ppm in your centWave settings.

A note on running out of memory in centroiding. I would always do the centroiding file-by-file and export the results again as an mzML file. The script in https://github.com/EuracBiomedicalResearch/batch_centroid might help, I used this to centroid 25,000 profile-mode mzML files on our cluster, one file at a time (actually always 12 in parallel).

zizekw commented 5 years ago

Hi @jorainer, thanks again for all your feedback. It's been a slow process but I think I have everything worked out now for the graduate students in my lab from your assistance!

My last question on this matter would be the following: in cases like I describe above where CentWave misses a minor % of authentic peaks and without using the grouping functionality with fillChromPeaks(), is there a recommended way to do targeted "manual" detection where I can re-perform detection for troublesome parts of the data and merge with the existing peaks xcms object?

As an example, I was looking at the Insert Peak Function described in this publication so that if I had a target compound mz value I could rerun detection and merge the found peak on an as needed basis. Otherwise, would peaksWithCentWave() have utility in this case?

Let me know what you think! Please feel free to close this issue and attribute it to over-centroiding by Agilent. Thanks again for all your help and ongoing amazing work.

jorainer commented 5 years ago

There is no such functionality (yet) to merge/combine peak detection results - buuut, that sounds like a potential nice new functionality. Please add a new issue with this feature request - I will then work on that when I find some spare time.

zizekw commented 5 years ago

Hi @jorainer, just wanted to let you know please feel free to close this issue (you created it on my behalf so I can't).

We discussed above about how it seemed low intensity signals in my data were being "over-centroided".

Profile mode: Pregabalin_profile

Default centroiding parameters: Pregabalin_cent_default

Ultimately I found modulating the halfWindowSize for pickPeaks() to be most beneficial. Here's the same compound but with halfWindowSize = 1L.

Pregabalin_cent_halfwindow1

Not sure if this is the best solution to my problem but it certainly appears to work! Thank you for your help to date.