sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
177 stars 81 forks source link

ppm paramenter for ChromPeakAreaParam #756

Open stanstrup opened 3 days ago

stanstrup commented 3 days ago

Hi,

I was looking into the newish ChromPeakAreaParam and trying to understand how wide windows it actually creates compared to the old way. It is definitely better than the old way that would go all the way to zero. But it still goes a lot lower than I think is likely optimal. So to start with the conclusion I think it would still be advisable to have a minimum ppm window like in the old function.

So first I calculated what the windows will look like for each feature. The peak picking ppm was 30. But we see that some windows would go as low as 7 ppm for some features. From the analysis below I would argue that that could be too low.

peaks <- cbind(chromPeaks(xset_g_r_g_fill), chromPeakData(xset_g_r_g_fill)) %>% 
          as_tibble %>% 
          mutate(filled = if_else(is_filled, "filled", "not filled"))
feature_def <- featureDefinitions(xset_g_r_g_fill) %>% 
                as_tibble(rownames="feature_id") %>% 
                rename(f_mzmed = "mzmed", f_mzmin = "mzmin", f_mzmax = "mzmax", f_rtmed = "rtmed", f_rtmin = "rtmin", f_rtmax = "rtmax") %>% 
                unnest(peakidx)

peaks_sum <- peaks %>% 
                mutate(peakidx = 1:n()) %>% 
                full_join(feature_def, by = "peakidx") %>% 
                group_by(feature_id, filled) %>% 
                summarise(mz = quantile(mz, probs = 0.25, names = FALSE),
                          mzmin = quantile(mzmin, probs = 0.25, names = FALSE),
                          mzmax = quantile(mzmax, probs = 0.25, names = FALSE),
                          .groups = "drop"
                          )

peaks_sum_notfilled <- peaks_sum %>% 
  filter(filled == "not filled") %>% 
  mutate(ppm_dev = 1E6*(mzmax-mzmin)/mz)
ggplot(peaks_sum_notfilled, aes(x=ppm_dev)) +
  geom_histogram(position="identity", binwidth = 1,linewidth = NA) + 
  xlab("ppm") + ggtitle("Histogram of ppm window across matched peaks")+ 
  scale_x_continuous(breaks = seq(0,1000,10), lim = c(0,max(peaks_sum_notfilled$ppm_dev))) +
  theme_bw()

image


Then we can look at the range for the found peaks. We observe that for the peaks filled we more or less inherited the original ppm value, while the peak picked peaks go much higher, since that in the ROI definition the range can be more or less +/- the ppm setting.

ggplot(peaks, aes(x=1E6*(mzmax-mzmin)/mz)) +
  geom_histogram(position="identity", binwidth = 2,linewidth = NA) + 
  facet_grid(filled~., scales = "free_y") +
  xlab("ppm") + ggtitle("Histogram of ppm deviation for all peaks")+ 
  scale_x_continuous(breaks = seq(0,1000,10)) +
  theme_bw()

image

Finally I wanted to see how window size was correlated to intensity since I assume smaller peaks are less accurate and that gapfilling mainly will be look at small peaks.

ggplot(peaks, aes(x =into, y=1E6*(mzmax-mzmin)/mz)) +
  geom_point(size = 1, alpha = 0.5) + 
  facet_grid(filled~.) +
  xlab("intensity") + 
  ylab("ppm dev") + 
  ggtitle("Intensity, ppm deviation across matched peaks")+ 
  theme_bw()+ 
  scale_y_continuous(limits = c(0,NA))

image

My conclusion here would be that small peaks go as high as 60 ppm, but the windows we will be doing integrating in during gap-filling would be half that.


Bonus: if we set the prop to 1 the window sizes are still potentially too low.

image

So based on all this wouldn't a safely valve ppm paramenter make sense?