sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
185 stars 80 forks source link

peak integration in centWave uses mz range of the ROI #135

Open jorainer opened 7 years ago

jorainer commented 7 years ago

The signal for peaks identified with centWave are integrated considering the mzrange of the ROI:

        mzrange <- c(feat$mzmin, feat$mzmax)
        eic <- .Call("getEIC", mz, int, scanindex, as.double(mzrange),
                     as.integer(sr), as.integer(length(scanindex)),
                     PACKAGE = "xcms")
        ## eic <- rawEIC(object,mzrange=mzrange,scanrange=sr)
        d <- eic$intensity

with feat being one ROI. The d variable is then later used to integrate the peak intensity:

                pwid <- (scantime[peakrange[2]] - scantime[peakrange[1]]) /
                    (peakrange[2] - peakrange[1])
                if (is.na(pwid))
                    pwid <- 1
                peaks[p, "into"] <- pwid * sum(d[lm[1]:lm[2]])

The mzrange reported in the peaks table is however refined and is in most instances not equal to the mzrange of the ROI. While the function ensures that the (refined) rtrange of the peak is correctly used (the sum(d[lm[1]:lm[2]])) line above), the values correspond still to the full mzrange of the ROI, and not of the peak.

jorainer commented 7 years ago

A reproducible example:

library(xcms)
library(msdata)
fl <- system.file("microtofq/MM14.mzML", package = "msdata")
raw <- readMSData2(fl)
tmp <- findChromPeaks(raw, param = CentWaveParam(peakwidth = c(1, 20)))

## Extract the data manually:
spctr <- spectra(tmp)
mzs <- lapply(spctr, mz)
valsPerSpect <- lengths(mzs)
scanindex <- xcms:::valueCount2ScanIndex(valsPerSpect) ## Index vector for C calls
ints <- unlist(lapply(spctr, intensity), use.names = FALSE)
mzs <- unlist(mzs, use.names = FALSE)
rtim <- rtime(tmp)

rtr <- chromPeaks(tmp)[1, c("rtmin", "rtmax")]
mzr <- chromPeaks(tmp)[1, c("mzmin", "mzmax")]
sr <- c(min(which(rtim >= rtr[1])), max(which(rtim <= rtr[2])))
eic <- .Call("getEIC", mzs, ints, scanindex,
                     as.double(mzr),
                     as.integer(sr), as.integer(length(scanindex)),
                     PACKAGE = "xcms")
peakI <- sum(eic$intensity) * ((rtr[2] - rtr[1]) / (sr[2] - sr[1]))
peakI
464.9048

chromPeaks(tmp)[1, "into"] 
537.9036 

Thus, the intensities are clearly different.

jorainer commented 7 years ago

Actually, this does affect also the peak post-processing (refinement of the min rt and max rt) as that is also performed on the same variable d.

jorainer commented 7 years ago

The standard centWave seems to be working nicely if the peak signal is integrated for the actual peak area, but, surprisingly, the centWave with addPredIsoROIs yields different results.

One reason could be that the minimum value for the mzrange can become 0 (e.g. if no signal was measured for one mz-rt pair). This causes the into to be integrated from mz being 0 up to max(mzrange) which then affects the joinOverlappingPeaks etc

jorainer commented 7 years ago

I suggest we stick for now (Bioconductor release 3.5) with the old implementation. I'll do then some more tests and compare both versions.

cbroeckl commented 6 years ago

I am working with some collaborators - they are running XCMS on a cluster. I have requested the sessionInfo() from them. I am seeing odd behavior in which I am getting 'shadow' features - separated by a fraction of an m/z value which cluster (ramclustR) with the main feature. This is xcms after feature grouping and retention time correction, but I am looking at the xset@peaks data table. If I pick one region where I see this 'shadow' feature in the output dataset you get a plot as displayed in the attached pdf.

odd_behavior.pdf

In the peak table, there is a collection of features which have no variance in mass -they are always reported at the same mass across many injections - in this example there is a line of features at 335.1 . Normally, we would expect there to be some mass error, so that the wide cluster you see at ~335.065 is more expected.
Here are some of the peaks from the distinct line at 335.1:

    mz    mzmin   mzmax       rt    rtmin    rtmax      into intb      maxo sn sample

[1,] 335.1 335.0666 335.069 202.4388 197.8425 202.4388 2505.8763 NA 692.8037 NA 351 [2,] 335.1 335.0666 335.069 200.9332 196.7415 201.7693 2803.0546 NA 932.7637 NA 743 [3,] 335.1 335.0666 335.069 199.5418 195.4836 200.6959 4369.8913 NA 2256.9473 NA 1572 [4,] 335.1 335.0666 335.069 195.5450 191.7441 195.8892 613.1164 NA 243.0317 NA 1906

This is probably a bit hard to read, but what you will notice is that the mz value in the first column is actually GREATER than the mzmax value, which I did not expect should ever happen. Here are some peaks from the 335.065 section:

       mz    mzmin    mzmax       rt    rtmin    rtmax      into      intb     maxo   sn sample

[1,] 335.0680 335.0671 335.0688 197.7636 196.3496 202.5895 237479.78 237356.40 74770.62 1711 9 [2,] 335.0676 335.0669 335.0681 198.6677 196.4094 201.3769 119672.15 117378.37 39937.69 52 645 [3,] 335.0687 335.0671 335.0698 196.8451 194.7718 198.9233 65738.36 64456.27 25053.91 34 3063

you can see that the mz value is always between mzmin and mzmax, which is the way I expected it to be.

This is happening frequently enough that I end up with two separate features, but am fairly certain, in looking at the mz range for the first table, that these represent the same data. I could certainly replace the mz value for all the offending peaks with the mean of the mzmin and mzmax, that would need to be done immediately after running centWave.

When I looked at the issue index, this came up as a possible issue, as you describe the ROI mass range being used. I am not sure if it is related, but given that the falls on a nice round mz value, I expect there may be a link. I would appreciate any guidance on whether this issue has been addressed and if so which version of XCMS we need be using. If it has not, can I expect replacing mz value that are either less than mzmin or greater than mzmax to not disrupt downstream processing? I will follow up with sessionInfo when I get it. thanks. Corey

jorainer commented 6 years ago

@cbroeckl thanks for digging into this! the code you are referring was implemented but never activated. You can activate it by setting options(originalCentWave = FALSE). centWave peak detection will then use the modified code. By changing back to options(originalCentWave = TRUE) you get the default behaviour. This should be available in xcms version 3.2.0 (the xcms version for Bioconductor release 3.7, R version 3.5).

cbroeckl commented 6 years ago

Thanks! I will try this.

Of note, it appears that all the features in the peaks table that have mz values that is out of the range of mzmin:mzmax have sn values of NA. Every one. Since sn = NA, intb = NA. However, it appears that sn = NA is not sufficient to create this issue, as about 3% of all my values that have NA sn values have mz > mzmin and mz < mzmax. The 3% of peaks which have an sn value of NA and also have an mz value in the range of mzmin:mzmax nearly universally have mz values which equal round(mz, digits = 1). I am using round(mz, digits = 1) as a surrogate for mz = ROI.... There are a few exceptions to this trend, and in those cases, the mz range is very close to the round(mz, digits = 1) value, but does not quote reach it.

xset4@peaks[sample(sn.and.in.range, 10),c(1:3, 7:11)] mz mzmin mzmax into intb maxo sn sample [1,] 478.2000 478.1953 478.2024 447.8157 NA 142.1371 NA 269 [2,] 195.1000 195.0998 195.1022 2984.2040 NA 1552.7041 NA 2211 [3,] 1016.4624 1016.4486 1016.4681 1820.6817 NA 1039.0000 NA 1320 [4,] 119.0506 119.0492 119.0510 4263.1994 NA 1386.7900 NA 1926 [5,] 955.5000 955.4961 955.5144 116.3307 NA 121.0000 NA 2278 [6,] 194.1000 194.0968 194.1014 397.0972 NA 280.0000 NA 636 [7,] 500.2485 500.2467 500.2534 1300.3988 NA 752.0742 NA 3063 [8,] 346.2236 346.2203 346.2270 0.0000 NA 0.0000 NA 466 [9,] 343.1000 343.0961 343.1025 3387.1611 NA 1504.1719 NA 1927 [10,] 317.2490 317.2487 317.2535 5028.5997 NA 2884.1113 NA 123

So from what I can tell, the sn = NA value is necessary for an assignment of mz = ROI, but it is not sufficient. However, my diagnostic for which mz = ROI is imperfect, and those peaks which have sn = NA values and mz in range of the mzmin:mzmax generally happen to have mzrange values that span the ROI value. There are a percentage of values - see rows 3, 4, 7,8, and 10 above - where the mz value assigned is not an ROI mz. In these cases, most of the values are near the ROI in mzrange - that is the mass tolerance setting provides sufficient overlap to cover the ROI. However, this isn't universal. I will keep looking into this to see if I can find a pattern, but clearly the sn = NA is a necessary first step in creating the problem.

cbroeckl commented 6 years ago

One more observation to add: all the peaks in the peak table that have sn values = NA are at the end of the peak table, but it does not correspond to the filled peak range.

sn.na <- which(is.na(xset4@peaks[,"sn"])) sn.val <- which(!is.na(xset4@peaks[,"sn"]))

range(sn.na) [1] 7384657 23279826

range(sn.val) [1] 1 7384656

range(xset4@filled) [1] 16170350 23279826

jorainer commented 6 years ago

It seems you are using the old functions and objects - we are phasing them out in favor of the new function and the objects that are more stable and allow a better handling/accessing the MS data. You should consider also using them (i.e. read the data with readMSData( , mode = "onDisk"), do peak detection using findChromPeaks etc).

cbroeckl commented 6 years ago

Yeah - I intend to use the new functions/objects, just haven't gotten there yet. Do you expect any difference in the behavior of new vs old functions?

jorainer commented 6 years ago

the peak detection code etc did not change. It's more how the data is handled. So, no I don't expect many changes (except for obiwarp and peak filling).