Re: the peak width parameters in XCMS

oliverververver commented 3 years ago

Hello,

I have a simple question about one of the parameters for feature extraction using XCMS, the peak width. Based on the description, the max and min peak width represent the narrowest and widest peaks which could be accepted for the following feature extraction. So if this parameter is set to be (5, 60), the number of features should be higher than when this parameter is set to be (5, 20) with all other parameters the same. However, the results are sometimes contradictory to my expectation. I am wondering if this parameter is not only used to filter the putative features based on their peak widths but also somewhere else that I do not know of. It is appreaciated if you could enlignten me on this issue. Many thanks.

Best,

Felix

jorainer commented 3 years ago

You can not take this parameter as a hard cutoff for peaks. In fact, centWave selects based on this parameter different scales for the continuous wavelet transform. I usually select a peakWidth that is not too wide (e.g. I select 2,20 instead of 2,50) and that is ~ the width of the expected peaks in the data (you can still find peaks larger than the upper peak width).

It's a little trial and error to select the best settings for your data - but don't try to overoptimize. I suggest to check random peaks in the data to see if they look OK or also check how the setting performed on e.g. internal standards or other compounds you know are in the data.

gmhhope commented 3 years ago

glucose_185.0758298476_185.0765701524rtime_0_250-ppm_2

I actually have a internal standard C13-glucose peak which is super narrow (maybe from 1-4). And I still cannot get this peak picked by CentWave (probably too narrow?).

Question

Do you have any suggestions? In this experiment, we are running a relatively short LC method (5 min).

gmhhope commented 3 years ago

A test on how peak width range affects feature number

I have done an expanded test to understand how peak width minimum and maximum effect feature number (This LC method is on HILICpos and 5-min method; with high-resolution scan in orbitrap, we have ~ 3 scans/second). It looks like the pattern is, as described by Oliver, counterintuitive. The major finding I can see here is that
- The major factor here is the minimum peak width while changing maximum peak width doesn't affect much. Of note, I have already turn extendLengthMSW = TRUE.
- The minimum peak width <=1 results in a dramatic reduction of feature number and thus also affect "subset" alignment.
- peak width = 2 has a moderate number of features, yet it somehow affects the peak group number used for peak group alignment (also see #578).
- The others look much better and similar with minimum peak width >=3, no matter maximum peak width is 20, 25 or 30.

Question

Why there is a watershed-like performance between the minimum peak width ~2?
Based on this, which peak width ranges you may want to choose based on your experience?

output_name | peakwidth_min | peakwidth_max | peak_group4RTalign | featureNum_noGroupFiltering | Alignment_warnings -- | -- | -- | -- | -- | -- run_Aug-16-2021-17-47_minp0.5_maxp30_noQstd | 0.5 | 30 | 444 | 13471 | Fitted retention time deviation curves exceed points by more than 2x run_Aug-16-2021-16-27_minp1_maxp30_noQstd | 1 | 30 | 444 | 13471 | Fitted retention time deviation curves exceed points by more than 2x run_Aug-16-2021-22-39_minp0.3_maxp30_noQstd | 0.3 | 30 | 33 | no_table_output | Too few peak groups for 'loess', reverting to linear method run_Aug-16-2021-23-56_minp1_maxp60_noQstd | 1 | 60 | 450 | 14460 | Fitted retention time deviation curves exceed points by more than 2x run_Aug-16-2021-23-44_minp2_maxp30_noQstd | 2 | 30 | 772 | 29307 | yes run_Aug-16-2021-23-55_minp3_maxp30_noQstd | 3 | 30 | 1016 | 29200 | no run_Aug-16-2021-23-55_minp4_maxp30_noQstd | 4 | 30 | 1104 | 28892 | no run_Aug-16-2021-23-55_minp5_maxp30_noQstd | 5 | 30 | 1130 | 28379 | no run_Aug-17-2021-14-01_minp2_maxp20_noQstd | 2 | 20 | 755 | 29569 | Adjusted retention times had to be re-adjusted for some files to ensure them being in the same order than the raw retention times. A call to 'dropAdjustedRtime' might thus fail to restore retention times of chromatographic peaks to their original values. Eventually consider to increase the value of the 'span' parameter. run_Aug-17-2021-14-02_minp3_maxp20_noQstd | 3 | 20 | 1101 | 29624 | no run_Aug-17-2021-14-01_minp5_maxp20_noQstd | 5 | 20 | 1101 | 28470 | no run_Aug-17-2021-14-01_minp2_maxp20_noQstd | 2 | 25 | 755 | 29560 | Adjusted retention times had to be re-adjusted for some files to ensure them being in the same order than the raw retention times. A call to 'dropAdjustedRtime' might thus fail to restore retention times of chromatographic peaks to their original values. Eventually consider to increase the value of the 'span' parameter. run_Aug-17-2021-14-01_minp5_maxp25_noQstd | 5 | 25 | 1120 | 28463 | no

jorainer commented 3 years ago

I'm a little puzzled by the very few datapoints you have in the plot(... , type = "XIC") above. There it looks like you have a single high intensity signal but not really a real chromatographic peak. Can you maybe increase a little the m/z range to see if you get more peaks?

In general, any peak detection algorithm will have a hard time to reliably detect a peak if it has only few data points available. Also, for centWave you might check parameters prefilter - if you have so few higher intensity signals per chromatographic peak you might reduce that parameter.

dahezhao commented 1 year ago

Hello, I have met the same issue. The output of detected chromatographic peaks depends on the peak width setting.

I test one compound: mzr <- c(392.7, 393.7) other commands:

rtr <- c(480, 2520)
chr_raw <- chromatogram(raw_data, mz = mzr, rt = rtr)
cwp <- CentWaveParam(peakwidth = c(0.001, 8.000))
xdata <- findChromPeaks(chr_raw, param = cwp)
chromPeaks(xdata)

I can get two peaks.

When I set peakwidth = c(0.001, 7.000), or peakwidth = c(0.001, 6.000), I can also get the two; but when the max of peakwidth is 9.000 or more, and 5.000 or less, I could not get anyone.

When peakwidth = c(0.01, 6.000), or peakwidth = c(0.1, 6.000), I can also get the two; but when the min of peakwidth is 1, I can get only one peak.

My questions are: which setting do you think is better? which setting could be used for other compounds in the same experiment? and should I optimize other arguments to make the settings robust for other compounds? If I want to detect as many peaks as possible, is there other function or arguments should I use or optimize?

Thanks a lot.

Dahe Zhao

jorainer commented 1 year ago

I would suggest to use peak width settings that are somewhat reasonable based on your data. I usually set the lower value to about half the width of some peaks I check before in the data and for the upper also something like 2 to 4 times the average size. I don't think it's good if you use as the lower peak width a value of 0.001. this lower value should somehow make sense. I guess you would not trust any peak that is consisting of less than 3 data points - thus, the minimal peak width could be the difference of retention time between 3 consecutive spectra (i.e. rt of the third - rt of the first) - not saying you should use that value - again, I suggest to use half of the generally observed average peak width.

sneumann / xcms