sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
177 stars 81 forks source link

why do_groupChromPeaks_density processing peaks with fixed size of the overlapping slices in mz dimension #711

Closed WallFacerLR closed 5 months ago

WallFacerLR commented 5 months ago

Retrieved the source code in do_groupChromPeaks_density This function process peaks with fixed binSize; numeric(1) defining the size of the overlapping slices in mz dimension. As code show, it split mz only based on mz range and binSize and then process in each slice.

  mass <- seq(peaks[1, "mz"], peaks[nrow(peaks), "mz"] + binSize, 
    by = binSize/2) 

  masspos <- findEqualGreaterM(peaks[, "mz"], mass)
  densFrom <- rtRange[1] - 3 * bw
  densTo <- rtRange[2] + 3 * bw
  densN <- max(512, 2 * 2^(ceiling(log2(diff(rtRange)/(bw/2)))))
  endIdx <- 0
  message("Processing ", length(mass) - 1, " mz slices ... ", 
    appendLF = FALSE)
  resL <- vector("list", (length(mass) - 2))
  for (i in seq_len(length(mass) - 2)) {
    startIdx <- masspos[i]
    endIdx <- masspos[i + 2] - 1
    if (endIdx - startIdx < 0) 
      next
    resL[[i]] <- .group_peaks_density(peaks[startIdx:endIdx, 
      , drop = FALSE], bw = bw, densFrom = densFrom, densTo = densTo, 
      densN = densN, sampleGroups = sampleGroups, sampleGroupTable = sampleGroupTable, 
      minFraction = minFraction, minSamples = minSamples, 
      maxFeatures = maxFeatures, sleep = sleep)
  }

But the width of mz distribution is related to absolute mz value (such as 20 ppm), not a fixed value. The same problem exist in do_groupChromPeaks_nearest with param absMz Why not slice mz in a mz value dependant manner

jorainer commented 5 months ago

That's a good question. It was implemented that way in the original implementation - I would maybe also implement it in the way you suggested. For TOF instruments (that have a m/z dependent measurement error) having a fixed m/z bin size is indeed not ideal, but I also don't think it makes a huge difference. IMHO, by considering both m/z and rt it should be unlikely that signal from different compounds will be grouped together.

WallFacerLR commented 5 months ago

That's a good question. It was implemented that way in the original implementation - I would maybe also implement it in the way you suggested. For TOF instruments (that have a m/z dependent measurement error) having a fixed m/z bin size is indeed not ideal, but I also don't think it makes a huge difference. IMHO, by considering both m/z and rt it should be unlikely that signal from different compounds will be grouped together.

OK, I agree that doesn't matter in most conditions. We just traced this problems when develop some methods requiring extreme resolution and accuracy. In these conditions, the fixed windows lead to wrong peak assignment. We could implement this strategy in out method. And for general using, maybe you know more about which one is more suitable

sneumann commented 5 months ago

Hi, the grou pdensity goes back to 2006 for triple quad instruments. We found that only relative ppm would require high ppm tolerance for small masses to accommodate rounding issues, and only absolute deviation would require large values for instrument with m/z dependent errors and higher masses. In MetFrag we used mzabs + mzppm as approach to avoid that, and the boeckerlab similarly used min(absolute, relative) errors as similar approach. Yours, Steffen

sneumann commented 5 months ago

Ah, last two answers crossing while traveling the network :-) At some stage we might've pondered to run grouping twice with different m/z errors and combining the low mass peak groups from one and high masses from the other run. Not a solution, but maybe a workaround. Yours, Steffen

WallFacerLR commented 5 months ago

Ah, last two answers crossing while traveling the network :-) At some stage we might've pondered to run grouping twice with different m/z errors and combining the low mass peak groups from one and high masses from the other run. Not a solution, but maybe a workaround. Yours, Steffen

Thanks, that's helpful suggestion.

jorainer commented 5 months ago

I'll also check if it would be possible to include an additional ppm parameter to have m/z dependent bin sizes.

jorainer commented 5 months ago

Seems it should be relatively simple to add a parameter ppm and support m/z dependent bin (slice) sizes. Would be great if you could evaluate and test it once I've added it @WallFacerLR

WallFacerLR commented 5 months ago

Seems it should be relatively simple to add a parameter ppm and support m/z dependent bin (slice) sizes. Would be great if you could evaluate and test it once I've added it @WallFacerLR

Thanks, that's great. I'm looking forward to see it in future update.

jorainer commented 5 months ago

Firstly, I made a PR to the MsCoreUtils package (https://github.com/rformassspectrometry/MsCoreUtils/pull/121) to allow creation of m/z dependent bin sizes. In brief: the size of the bins will increase with the values (based on the ppm of the m/z value). This setup can then also be used in xcms to define the bins along the m/z axis for the peak density-based correspondence.

jorainer commented 5 months ago

The functionality is now implemented in the peak_density_ppm branch, in brief, the PeakDensityParam gains a parameter ppm with the default ppm = 0, which is equivalent to the original approach (i.e. fixed bin sizes along m/z). Setting a ppm larger than 0 will create m/z dependent bin sizes (increasing with m/z hence better representing the measurement precision/error of e.g. TOF instruments.

jorainer commented 5 months ago

An evaluation on one of our data sets (~600) samples:

Performing first the original correspondence:

param <- PeakDensityParam(sampleGroups = sampleData(nafld)$sample_type,
    minFraction = 0.3, binSize = 0.015, bw = 2)
nafld_a <- groupChromPeaks(nafld, param = param)
nrow(featureDefinitions(nafld_a))
[1] 11133

And then with the new setup enabling m/z dependent bins.

param <- PeakDensityParam(sampleGroups = sampleData(nafld)$sample_type,
                          minFraction = 0.3, binSize = 0.01, bw = 2,
                          ppm = 10)
nafld_b <- groupChromPeaks(nafld, param = param)
nrow(featureDefinitions(nafld_b))
[1] 11733

With the new approach we define slightly more features.

Next we compare the number of peaks per feature:

a <- lengths(featureDefinitions(nafld_a)$peakidx)
quantile(a)
  0%  25%  50%  75% 100% 
   8  123  179  269 1093 
b <- lengths(featureDefinitions(nafld_b)$peakidx)
quantile(b)
  0%  25%  50%  75% 100% 
   8  124  181  276 1086 

We have slightly more peaks per feature with the m/z dependent bin sizes.

Next we just compare the rt - m/z of the defined features to evaluate whether about the same features were defined. We plot the rt - m/z value of each feature, for the fixed-sized bins in red and the m/z relative bins in blue.

plot(featureDefinitions(nafld_a)$rtmed, featureDefinitions(nafld_a)$mzmed,
     pch = 16, col = "#ff000020", xlab = "rtime",
     ylab = "m/z", cex = 0.5)
points(featureDefinitions(nafld_b)$rtmed, featureDefinitions(nafld_b)$mzmed,
       pch = 16, col = "#0000ff20", cex = 0.5)
grid()

features-rt-mz

Mostly the same features were defined (purple colored points). Features defined only with the m/z-relative bin sizes are mostly found in the higher m/z range (blue points).

We next compare the m/z width of features between the two methods (m/z max - m/z min for each feature). a is for the fixed-sized bin approach, b for the m/z relative.

a <- featureDefinitions(nafld_a)$mzmax - featureDefinitions(nafld_a)$mzmin
quantile(a)
          0%          25%          50%          75%         100% 
0.0007820136 0.0101199285 0.0132723343 0.0144444768 0.0149955668 

b <- featureDefinitions(nafld_b)$mzmax - featureDefinitions(nafld_b)$mzmin
quantile(b)
          0%          25%          50%          75%         100% 
0.0007820136 0.0105889578 0.0154410666 0.0201860302 0.0299283892 

The m/z widths are thus larger for the m/z relative bin size setup. The m/z width for the constant bin size is capped at the defined binSize value (0.015). We next plot also the relationship between m/z width and (median) m/z for each feature.

plot(featureDefinitions(nafld_a)$mzmed, a, xlab = "m/z", ylab = "m/z width",
     pch = 16, col = "#ff000020", ylim = range(b), cex = 0.5)
points(featureDefinitions(nafld_b)$mzmed, b, pch = 16,
       col = "#0000ff20", cex = 0.5)
grid()

feature-mz-mzwidth

This clearly shows the impact of the m/z-dependent size of the m/z bins. The m/z width for the features defined with a constant binSize = 0.015 (red points) is capped at this binSize, while the m/z width for the m/z dependent bin size (binSize = 0.01, ppm = 10) increases with m/z. We did not check the impact on individual features yet, but the new approach seems more appropriate for e.g. TOF instruments.

UPDATE: in this evaluation the ppm was not used correctly and hence too large m/z bins were defined. In the updated code the max m/z width is 0.02 (instead of 0.03) which corresponds to 0.01 + MsCoreUtils::ppm(1000, ppm = 10).

jorainer commented 5 months ago

While IMHO the setup that considers m/z-dependent bin sizes makes sense it would be good to get some feedback/thoughts also on your data @sneumann @WallFacerLR

jorainer commented 5 months ago

Actually, with the current implementation the bin sizes are too large. The way how it's implemented in xcms I need to use ppm / 2 in the definition of the bins (because the function defined features always in two overlapping bins, and also binSize / 2 is used to determine the fixed size of the bin.

jorainer commented 5 months ago

This is now fixed. With this setting, the maximal m/z width of features in the example above is also 0.02 (instead of 0.03), i.e. binSize + MsCoreUtils::ppm(1000, 10), with 1000 being the max observed m/z value. With the used binSize of 0.01 this results in a max m/z width of 0.02.

jorainer commented 5 months ago

The updated plot from above:

feature-mz-mzwidth

WallFacerLR commented 5 months ago

Thanks for so quickly and efficiently solve this problem! The introduce of ppm is highly effective and perform well in our data.

Here I introduce part of our data. These data were acquired in Orbitrap instrument with 140000 resolution. We have some isotope-induced peaks with same RT and close mz, such as 581.0317 and 581.0336 below.

xcms.spectra.581 <-xcms.spectra%>%
  filterMzRange(c(581, 581.05))%>%
  filterRt(c(1020, 1060))
xcms.sp.data <- get_Spectra_data(xcms.spectra.581,
                                 var = c("rtime","dataOrigin"))%>%
  dplyr::mutate(dataOrigin = str_extract(dataOrigin,"Con|Glu"))
plot_spectra(xcms.sp.data)

image


ggplot(xcms.sp.data)+
  geom_point(aes(x = rtime, y = mz ,col = dataOrigin ))

image

To discriminate them, we have to set binSize down to 0.002

featureDefinitions(xcms.binsize.0.002)%>%as.data.frame()%>%
  dplyr::filter(mzmed <581.04,mzmed >581.03)

       feature_id       mzmed       mzmin       mzmax       rtmed       rtmin       rtmax
FT6569     FT6569 581.0316659 581.0315778 581.0317051 1045.777481 1045.681240 1046.093026
FT6570     FT6570 581.0335896 581.0335061 581.0337977 1046.287969 1045.568267 1046.941344

but this rising other problem, too small binSize will increase the probability of wrong split of mz distribution. This happened when the bin boundary occasionally located at the normal mz range.

xcms.fdf.wrong <- featureDefinitions(xcms.binsize.0.002)%>%
    as.data.frame()%>%
    dplyr::arrange(mzmed)%>%
    dplyr::mutate(group = MSdev:::cluster_ion(mzmed,rtmed,ppm = 5,rt.tol = 5))%>%
    dplyr::group_by(group)%>%
    dplyr::filter(n()>1)
  options(digits = 10)
  head(as.data.frame(xcms.fdf.wrong[,1:5]))

  feature_id       mzmed       mzmin       mzmax        rtmed        rtmin
1     FT0449 114.0241581 114.0241231 114.0242203  653.3663455  634.0284877
2     FT0450 114.0242626 114.0242397 114.0243065  655.5413471  634.3243351
3     FT0723 128.0811219 128.0809830 128.0812224  753.7084733  705.1643165
4     FT0724 128.0812941 128.0812309 128.0813383  754.1380020  752.1392251
5     FT1115 145.0950593 145.0941234 145.0951800 1062.2346730 1061.5156002
6     FT1116 145.0952629 145.0952316 145.0959241 1062.1155157 1061.3682617

nrow(xcms.fdf.wrong)

100

FT0449 and FT0450 are single peaks but splited to two features, a rough stat found 100 features maybe wrongly splited

xcms.spectra.114 <-xcms.spectra%>%
    filterMzRange(c(114.0235, 114.025))%>%
    filterRt(c(634.02849, 667.1469))
  xcms.sp.data <- get_Spectra_data(xcms.spectra.114,
                                   var = c("rtime","dataOrigin"))%>%
    dplyr::mutate(dataOrigin = str_extract(dataOrigin,"Con|Glu"))
  plot_spectra(xcms.sp.data)+
    geom_vline(xintercept = c(114.0241231, 114.0242203,
                              114.0242397, 114.0243065))

image

After add ppm, this has been greatly improve, the above peaks 581s are correctly assigned when ppm = 2. Meanwhile, the wrong features number decrease to 24 (This evaluation are not rigorous, the ppm to estimate two close feature are wrongly splited are set as 5 when binSize but 2 when our new ppm param. strictly, it should be accordant with binsize. IMHO, this will lead to underestimate binSize condition) and correctly assign ions 114.0242397

featureDefinitions(xcms.binsize.0.002)%>%as.data.frame()%>%
  dplyr::filter(mzmed <581.04,mzmed >581.03)
       feature_id       mzmed       mzmin       mzmax       rtmed       rtmin       rtmax
FT6569     FT6569 581.0316659 581.0315778 581.0317051 1045.777481 1045.681240 1046.093026
FT6570     FT6570 581.0335896 581.0335061 581.0337977 1046.287969 1045.568267 1046.941344
featureDefinitions(xcms.ppm.5)%>%as.data.frame()%>%
  dplyr::filter(mzmed <581.04,mzmed >581.03)
             mzmed       mzmin       mzmax       rtmed       rtmin       rtmax npeaks Sample
FT6373 581.0335061 581.0315778 581.0337977 1046.093026 1045.568267 1046.941344      9      6
featureDefinitions(xcms.ppm.2)%>%as.data.frame()%>%
  dplyr::filter(mzmed <581.04,mzmed >581.03)
               mzmed       mzmin       mzmax       rtmed       rtmin       rtmax npeaks Sample
FT6479 581.0316659 581.0315778 581.0317051 1045.777481 1045.681240 1046.093026      3      3
FT6480 581.0335896 581.0335061 581.0337977 1046.287969 1045.568267 1046.941344      6      3

xcms.fdf.wrong.ppm2 <- featureDefinitions(xcms.ppm.2)%>%
    as.data.frame()%>%
    dplyr::arrange(mzmed)%>%
    dplyr::mutate(group = MSdev:::cluster_ion(mzmed,rtmed,ppm = 2,rt.tol  = 5))%>%
    dplyr::group_by(group)%>%
    dplyr::filter(n()>1)
  nrow(xcms.fdf.wrong.ppm2)
24

featureDefinitions(xcms.ppm.2)%>%
    as.data.frame()%>%
    dplyr::filter(mzmed <114.025,mzmed >114.02)

             mzmed       mzmin       mzmax       rtmed       rtmin       rtmax npeaks Sample
FT0453 114.0242397 114.0241231 114.0243065 654.2255115 634.0284877 707.0990032     21      6

Hope these will be helpful for you

jorainer commented 5 months ago

Thanks for the information @WallFacerLR !