pickPeaks fails to centroid if no flanking zeroes are present

RogerGinBer commented 1 year ago

Hi there!

While working with raw LC-IM-MS data (related to https://github.com/sneumann/xcms/pull/647), I've noticed that pickPeaks does not correctly centroid the data if no flanking zero values are present. Concretely, it can:

Fail to keep valid data points
Alter very signifcantly the mz values via refineCentroids (uses mz points that are far away)

Here's a reproducible example, using directly the internal .peaks_pick. The input data is a real, rather noisy, TOF scan where, despite being in profile (I've double-checked it), it's very sparse (has no zero values):

pdata <- matrix(
    c(57.07044,89.07047,171.15151,186.95607,188.18357,190.06249,191.10797,196.08091,
      201.11109,205.08090,210.13168,210.97000,215.13854,218.10776,
      2996,2170,269,1607,21,273,288,314,203,197,3102,236,1779,47),
    ncol = 2
)
pdata #Optimal result, applying the function shouldn't change anything
Spectra:::.peaks_pick(pdata, spectrumMsLevel = 1) #Only keeps 3 mz peaks
Spectra:::.peaks_pick(pdata, spectrumMsLevel = 1, halfWindowSize = 1L) #Lowering hws helps, but still not good
Spectra:::.peaks_pick(pdata, spectrumMsLevel = 1, halfWindowSize = 1L, k = 2L) #Using the neighbors for refining messes up the mzs

To work around this problem, I'd like to contribute with this solution (still a sketch, probably should refactor to remove dependency on R.utils), which basically appends an arbitrary amount of zero values in the peak matrix when a mass "gap" larger than a tolerance is found (usually not larger than 0.01Da for HRMS instruments). Also, we keep the mz order of the data by imputing mz values in between (see last examples). This function would be called first and then the regular peakPicks would go on as usual:

.add_profile_zeros <- function (x, tol = 0.2, n = 5) {
    if (!nrow(x)) return()
    pos <- which(diff(x[,1]) > tol) + 1
    if (!length(pos)) return(x)
    mzs <- R.utils::insert(x[,1], ats = pos, values = lapply(x[pos - 1,1], function(x) rep(x + tol/2, each = n))) 
    ints <- R.utils::insert(x[,2], ats = pos, values = list(rep(0, times = n)))
    matrix(c(mzs, ints), ncol = 2)
}
pickPeaksSparse <- function (object, ...) 
{
    .local <- function (object, massGap = 0.2, nZeros = 5L, ...) 
    {
        if (!is.numeric(massGap) || length(massGap) != 1L || massGap <= 0L) 
            stop("Argument 'massGap' has to be an numeric of length 1 ", 
                 "and > 0.")
        if (!is.integer(nZeros) || length(nZeros) != 1L || nZeros <= 0L) 
            stop("Argument 'nZeros' has to be an integer of length 1 that is > 0.")

        object <- addProcessing(object, .add_profile_zeros, tol = massGap, 
                                n = nZeros, ...)
        object@processing <- Spectra:::.logging(object@processing, "Filled mz gaps larger than ", 
                                      massGap, " with ", nZeros, " zeros.")
        object <- pickPeaks(object, ...)
    }
    .local(object, ...)
}

And now it would work:

.add_profile_zeros(pdata) 
Spectra:::.peaks_pick(.add_profile_zeros(pdata), spectrumMsLevel = 1, halfWindowSize = 1L, k = 2L)

What do you think? Would this internal .add_profile_zeros belong here or in MsCoreUtils?

I'll be happy to contribute, :+1: Roger

jorainer commented 1 year ago

Thanks @RogerGinBer for the reproducible example. I'm wondering if that's not a problem with the actual data itself? as you say, for profile data this is very sparse. Don't know which instruments you're using, but also Sciex profile mode data has no 0 intensities and is sparse, but not that much that you have here.

While your workaround seems to work I would maybe prefer to fix the core functions used by peaks_pick in the MsCoreUtils package and avoid the additional step of adding 0-values (which will also have a negative impact on the performance). Can you maybe have a look into the noise, localMaxima function to evaluate which one would be the one that needs adjustment?

The refineCentroids also has an issue that needs to be fixed (related to this PR.

sgibb commented 1 year ago

Maybe it is a problem with the export tool? E.g. ages ago, compassXport for Bruker MALDI data removes zeros (and very low values) to keep the size of the mz(X)ML files small. If I remember correctly there was no command line argument but you had to set a key it the windows registry to avoid this behaviour.

RogerGinBer commented 1 year ago

Thanks to both! Actually, it was as @jorainer pointed out, and the data itself (Bruker, Tims-TOF Pro 2) was actually already centroided, but I failed to notice it :man_facepalming:

Digging deeper into the issue, I found what was the problem that led me to believe the data was profile:

In LC-IM-MS data, we acquire multiple scans (each with a slightly different IM value) within a single frame (=RT value): it turns out there are discrete, random fluctuations in the mz values reported across different scans.
When running combineSpectra (context: https://github.com/sneumann/xcms/pull/647) to merge LC-IM-MS scans into a condensed frame scan, the ppm value I set up was too low (regular <5ppm error in a TOF), so the fluctuating mz points were retained separately.

Here's what it this fluctuation looks like, around ~8-10 ppm per step (reflects exactly 1 TOF detector cycle difference):

After summarizing the mz values with combineSpectra the result looks a lot like a profile spectra but with no zero values (and that's where I got confused)

So, in summary, my problem was actually a completely different thing... :sweat_smile:
Could we smooth/correct the mz values across adjacent scans while keeping them as separate? Do we have something for that in Spectra? Or should I just increase the ppm parameter to a generous multiple when using combineSpectra (just like with xcms's centWaveParam)?

jorainer commented 1 year ago

Re smooth/correct the m/z values across adjacent scans: in MSnbase we had the combineSpectraMovingWindow that essentially allowed to smooth spectra along the rt dimension (with a moving window approach). I think we did not (yet) implement that for Spectra though.

Regarding the ppm error - yes, also our instruments should have < 5ppm error - but looking at the data that seems not to be correct. I usually see a bigger error.

rformassspectrometry / Spectra

pickPeaks fails to centroid if no flanking zeroes are present #257