Closed WallFacerLR closed 5 months ago
That's a good question. It was implemented that way in the original implementation - I would maybe also implement it in the way you suggested. For TOF instruments (that have a m/z dependent measurement error) having a fixed m/z bin size is indeed not ideal, but I also don't think it makes a huge difference. IMHO, by considering both m/z and rt it should be unlikely that signal from different compounds will be grouped together.
That's a good question. It was implemented that way in the original implementation - I would maybe also implement it in the way you suggested. For TOF instruments (that have a m/z dependent measurement error) having a fixed m/z bin size is indeed not ideal, but I also don't think it makes a huge difference. IMHO, by considering both m/z and rt it should be unlikely that signal from different compounds will be grouped together.
OK, I agree that doesn't matter in most conditions. We just traced this problems when develop some methods requiring extreme resolution and accuracy. In these conditions, the fixed windows lead to wrong peak assignment. We could implement this strategy in out method. And for general using, maybe you know more about which one is more suitable
Hi, the grou pdensity goes back to 2006 for triple quad instruments. We found that only relative ppm would require high ppm tolerance for small masses to accommodate rounding issues, and only absolute deviation would require large values for instrument with m/z dependent errors and higher masses. In MetFrag we used mzabs + mzppm as approach to avoid that, and the boeckerlab similarly used min(absolute, relative) errors as similar approach. Yours, Steffen
Ah, last two answers crossing while traveling the network :-) At some stage we might've pondered to run grouping twice with different m/z errors and combining the low mass peak groups from one and high masses from the other run. Not a solution, but maybe a workaround. Yours, Steffen
Ah, last two answers crossing while traveling the network :-) At some stage we might've pondered to run grouping twice with different m/z errors and combining the low mass peak groups from one and high masses from the other run. Not a solution, but maybe a workaround. Yours, Steffen
Thanks, that's helpful suggestion.
I'll also check if it would be possible to include an additional ppm
parameter to have m/z dependent bin sizes.
Seems it should be relatively simple to add a parameter ppm
and support m/z dependent bin (slice) sizes. Would be great if you could evaluate and test it once I've added it @WallFacerLR
Seems it should be relatively simple to add a parameter
ppm
and support m/z dependent bin (slice) sizes. Would be great if you could evaluate and test it once I've added it @WallFacerLR
Thanks, that's great. I'm looking forward to see it in future update.
Firstly, I made a PR to the MsCoreUtils
package (https://github.com/rformassspectrometry/MsCoreUtils/pull/121) to allow creation of m/z dependent bin sizes. In brief: the size of the bins will increase with the values (based on the ppm
of the m/z value). This setup can then also be used in xcms
to define the bins along the m/z axis for the peak density-based correspondence.
The functionality is now implemented in the peak_density_ppm branch, in brief, the PeakDensityParam
gains a parameter ppm
with the default ppm = 0
, which is equivalent to the original approach (i.e. fixed bin sizes along m/z). Setting a ppm
larger than 0 will create m/z dependent bin sizes (increasing with m/z hence better representing the measurement precision/error of e.g. TOF instruments.
An evaluation on one of our data sets (~600) samples:
Performing first the original correspondence:
param <- PeakDensityParam(sampleGroups = sampleData(nafld)$sample_type,
minFraction = 0.3, binSize = 0.015, bw = 2)
nafld_a <- groupChromPeaks(nafld, param = param)
nrow(featureDefinitions(nafld_a))
[1] 11133
And then with the new setup enabling m/z dependent bins.
param <- PeakDensityParam(sampleGroups = sampleData(nafld)$sample_type,
minFraction = 0.3, binSize = 0.01, bw = 2,
ppm = 10)
nafld_b <- groupChromPeaks(nafld, param = param)
nrow(featureDefinitions(nafld_b))
[1] 11733
With the new approach we define slightly more features.
Next we compare the number of peaks per feature:
a <- lengths(featureDefinitions(nafld_a)$peakidx)
quantile(a)
0% 25% 50% 75% 100%
8 123 179 269 1093
b <- lengths(featureDefinitions(nafld_b)$peakidx)
quantile(b)
0% 25% 50% 75% 100%
8 124 181 276 1086
We have slightly more peaks per feature with the m/z dependent bin sizes.
Next we just compare the rt - m/z of the defined features to evaluate whether about the same features were defined. We plot the rt - m/z value of each feature, for the fixed-sized bins in red and the m/z relative bins in blue.
plot(featureDefinitions(nafld_a)$rtmed, featureDefinitions(nafld_a)$mzmed,
pch = 16, col = "#ff000020", xlab = "rtime",
ylab = "m/z", cex = 0.5)
points(featureDefinitions(nafld_b)$rtmed, featureDefinitions(nafld_b)$mzmed,
pch = 16, col = "#0000ff20", cex = 0.5)
grid()
Mostly the same features were defined (purple colored points). Features defined only with the m/z-relative bin sizes are mostly found in the higher m/z range (blue points).
We next compare the m/z width of features between the two methods (m/z max - m/z min for each feature). a is for the fixed-sized bin approach, b for the m/z relative.
a <- featureDefinitions(nafld_a)$mzmax - featureDefinitions(nafld_a)$mzmin
quantile(a)
0% 25% 50% 75% 100%
0.0007820136 0.0101199285 0.0132723343 0.0144444768 0.0149955668
b <- featureDefinitions(nafld_b)$mzmax - featureDefinitions(nafld_b)$mzmin
quantile(b)
0% 25% 50% 75% 100%
0.0007820136 0.0105889578 0.0154410666 0.0201860302 0.0299283892
The m/z widths are thus larger for the m/z relative bin size setup. The m/z width for the constant bin size is capped at the defined binSize
value (0.015). We next plot also the relationship between m/z width and (median) m/z for each feature.
plot(featureDefinitions(nafld_a)$mzmed, a, xlab = "m/z", ylab = "m/z width",
pch = 16, col = "#ff000020", ylim = range(b), cex = 0.5)
points(featureDefinitions(nafld_b)$mzmed, b, pch = 16,
col = "#0000ff20", cex = 0.5)
grid()
This clearly shows the impact of the m/z-dependent size of the m/z bins. The m/z width for the features defined with a constant binSize = 0.015
(red points) is capped at this binSize
, while the m/z width for the m/z dependent bin size (binSize = 0.01
, ppm = 10
) increases with m/z. We did not check the impact on individual features yet, but the new approach seems more appropriate for e.g. TOF instruments.
UPDATE: in this evaluation the ppm
was not used correctly and hence too large m/z bins were defined. In the updated code the max m/z width is 0.02 (instead of 0.03) which corresponds to 0.01 + MsCoreUtils::ppm(1000, ppm = 10)
.
While IMHO the setup that considers m/z-dependent bin sizes makes sense it would be good to get some feedback/thoughts also on your data @sneumann @WallFacerLR
Actually, with the current implementation the bin sizes are too large. The way how it's implemented in xcms
I need to use ppm / 2
in the definition of the bins (because the function defined features always in two overlapping bins, and also binSize / 2
is used to determine the fixed size of the bin.
This is now fixed. With this setting, the maximal m/z width of features in the example above is also 0.02
(instead of 0.03
), i.e. binSize + MsCoreUtils::ppm(1000, 10)
, with 1000 being the max observed m/z value. With the used binSize
of 0.01 this results in a max m/z width of 0.02.
The updated plot from above:
Thanks for so quickly and efficiently solve this problem! The introduce of ppm is highly effective and perform well in our data.
Here I introduce part of our data. These data were acquired in Orbitrap instrument with 140000 resolution. We have some isotope-induced peaks with same RT and close mz, such as 581.0317 and 581.0336 below.
xcms.spectra.581 <-xcms.spectra%>%
filterMzRange(c(581, 581.05))%>%
filterRt(c(1020, 1060))
xcms.sp.data <- get_Spectra_data(xcms.spectra.581,
var = c("rtime","dataOrigin"))%>%
dplyr::mutate(dataOrigin = str_extract(dataOrigin,"Con|Glu"))
plot_spectra(xcms.sp.data)
ggplot(xcms.sp.data)+
geom_point(aes(x = rtime, y = mz ,col = dataOrigin ))
To discriminate them, we have to set binSize down to 0.002
featureDefinitions(xcms.binsize.0.002)%>%as.data.frame()%>%
dplyr::filter(mzmed <581.04,mzmed >581.03)
feature_id mzmed mzmin mzmax rtmed rtmin rtmax
FT6569 FT6569 581.0316659 581.0315778 581.0317051 1045.777481 1045.681240 1046.093026
FT6570 FT6570 581.0335896 581.0335061 581.0337977 1046.287969 1045.568267 1046.941344
but this rising other problem, too small binSize will increase the probability of wrong split of mz distribution. This happened when the bin boundary occasionally located at the normal mz range.
xcms.fdf.wrong <- featureDefinitions(xcms.binsize.0.002)%>%
as.data.frame()%>%
dplyr::arrange(mzmed)%>%
dplyr::mutate(group = MSdev:::cluster_ion(mzmed,rtmed,ppm = 5,rt.tol = 5))%>%
dplyr::group_by(group)%>%
dplyr::filter(n()>1)
options(digits = 10)
head(as.data.frame(xcms.fdf.wrong[,1:5]))
feature_id mzmed mzmin mzmax rtmed rtmin
1 FT0449 114.0241581 114.0241231 114.0242203 653.3663455 634.0284877
2 FT0450 114.0242626 114.0242397 114.0243065 655.5413471 634.3243351
3 FT0723 128.0811219 128.0809830 128.0812224 753.7084733 705.1643165
4 FT0724 128.0812941 128.0812309 128.0813383 754.1380020 752.1392251
5 FT1115 145.0950593 145.0941234 145.0951800 1062.2346730 1061.5156002
6 FT1116 145.0952629 145.0952316 145.0959241 1062.1155157 1061.3682617
nrow(xcms.fdf.wrong)
100
FT0449 and FT0450 are single peaks but splited to two features, a rough stat found 100 features maybe wrongly splited
xcms.spectra.114 <-xcms.spectra%>%
filterMzRange(c(114.0235, 114.025))%>%
filterRt(c(634.02849, 667.1469))
xcms.sp.data <- get_Spectra_data(xcms.spectra.114,
var = c("rtime","dataOrigin"))%>%
dplyr::mutate(dataOrigin = str_extract(dataOrigin,"Con|Glu"))
plot_spectra(xcms.sp.data)+
geom_vline(xintercept = c(114.0241231, 114.0242203,
114.0242397, 114.0243065))
After add ppm, this has been greatly improve, the above peaks 581s are correctly assigned when ppm = 2. Meanwhile, the wrong features number decrease to 24 (This evaluation are not rigorous, the ppm to estimate two close feature are wrongly splited are set as 5 when binSize but 2 when our new ppm param. strictly, it should be accordant with binsize. IMHO, this will lead to underestimate binSize condition) and correctly assign ions 114.0242397
featureDefinitions(xcms.binsize.0.002)%>%as.data.frame()%>%
dplyr::filter(mzmed <581.04,mzmed >581.03)
feature_id mzmed mzmin mzmax rtmed rtmin rtmax
FT6569 FT6569 581.0316659 581.0315778 581.0317051 1045.777481 1045.681240 1046.093026
FT6570 FT6570 581.0335896 581.0335061 581.0337977 1046.287969 1045.568267 1046.941344
featureDefinitions(xcms.ppm.5)%>%as.data.frame()%>%
dplyr::filter(mzmed <581.04,mzmed >581.03)
mzmed mzmin mzmax rtmed rtmin rtmax npeaks Sample
FT6373 581.0335061 581.0315778 581.0337977 1046.093026 1045.568267 1046.941344 9 6
featureDefinitions(xcms.ppm.2)%>%as.data.frame()%>%
dplyr::filter(mzmed <581.04,mzmed >581.03)
mzmed mzmin mzmax rtmed rtmin rtmax npeaks Sample
FT6479 581.0316659 581.0315778 581.0317051 1045.777481 1045.681240 1046.093026 3 3
FT6480 581.0335896 581.0335061 581.0337977 1046.287969 1045.568267 1046.941344 6 3
xcms.fdf.wrong.ppm2 <- featureDefinitions(xcms.ppm.2)%>%
as.data.frame()%>%
dplyr::arrange(mzmed)%>%
dplyr::mutate(group = MSdev:::cluster_ion(mzmed,rtmed,ppm = 2,rt.tol = 5))%>%
dplyr::group_by(group)%>%
dplyr::filter(n()>1)
nrow(xcms.fdf.wrong.ppm2)
24
featureDefinitions(xcms.ppm.2)%>%
as.data.frame()%>%
dplyr::filter(mzmed <114.025,mzmed >114.02)
mzmed mzmin mzmax rtmed rtmin rtmax npeaks Sample
FT0453 114.0242397 114.0241231 114.0243065 654.2255115 634.0284877 707.0990032 21 6
Hope these will be helpful for you
Thanks for the information @WallFacerLR !
Retrieved the source code in
do_groupChromPeaks_density
This function process peaks with fixedbinSize; numeric(1) defining the size of the overlapping slices in mz dimension.
As code show, it split mz only based on mz range and binSize and then process in each slice.But the width of mz distribution is related to absolute mz value (such as 20 ppm), not a fixed value. The same problem exist in
do_groupChromPeaks_nearest
with paramabsMz
Why not slice mz in a mz value dependant manner