lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
125 stars 50 forks source link

pickPeaks and estimateNoise use different `isCentroided` methods?! #150

Closed sgibb closed 8 years ago

sgibb commented 8 years ago

In pickPeaks you find the following lines to test for centroided mode: functions-Spectrum.R:395

if (!ignoreCentroided & isTRUE(object@centroided)) {

and in estimateNoise: functions-Spectrum.R:376

if (!ignoreCentroided & (object@centroided | is.na(object@centroided))) {

These lines are the reason why the example of ?plot.Spectrum.Spectrum is not working anymore: man/plotSpectrumSpectrum-methods.Rd:62-73

file <- dir(system.file(package = "MSnbase", dir = "extdata"),
            full.name = TRUE, pattern = "mzXML$")

## create basic MSnExp
msexp <- readMSData(file)

## centroid them
msexp <- pickPeaks(msexp)

While the test in pickPeaks results in a non-centroided spectrum, the test in estimateNoise assumes a centroided on and returns a matrix of NA. That's why SNR comparison fails and no peak is found, even worse an empty spectrum is returned.

We should replace both tests (and maybe more in other functions) by a isCentroided method that handle these cases. The current isCentroided,Spectrum-method uses the quantile function and doesn't investigate the centroided slot. Using the quantile method in all subroutines would unnecessarily slow down the processing. So we should rely on the slot, or?

lgatto commented 8 years ago

I'll start with some clarifications

  1. The default in readMSData[2] have changed from FALSE to NA, which is why some changes in these if tests have been made and the example now fails (sorry for the oversight). This has been fixed now (see 545280fd7cb27b8702b07a24f866c6e4b0127649). Obviously, there is a need for revisiting how the mode is assessed.
  2. isCentroided estimates whether the spectrum in centroided based on the raw data. To access the slot, there's the centroided accessor. We should always use the latter (or @centroided) in tests; the former is meant to be used when this information is not available.

I think we probably will want to identify different cases. Either, NA is acceptable, or one must be able to determine the mode before proceeding. In these cases, one should rely on the slot.

If the slot is NA, we could fall back on isCentroided, but it doesn't have 100% accuracy for individual spectra, especially if we consider different MS levels and different peak picking algorithms (see #131). In such situations, I think it is safer to stop with an error, and ask the user to update the object. isCentroided works well for a full experiment, if one assumes that all MS levels have the same mode. In that case, the majority of spectra will give the right answer, which can be used to set the mode.

> f
[1] "/home/lg390/R/x86_64-pc-linux-gnu-library/3.3/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
> f <- msdata::proteomics(full.names = TRUE)
> x <- readMSData2(f)
Reading 509 spectra from file TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz
> all(is.na(centroided(x)))
[1] TRUE
> invisible(isCentroided(x))
invisible(isCentroided(x))

ctrd      1   2
  FALSE  58   0
  TRUE    0 451
> centroided(x, msLevel = 1) <- FALSE
> centroided(x, msLevel = 2) <- TRUE
> table(centroided(x))

FALSE  TRUE 
   58   451 

May we could have something like this at the beginning of functions that require the mode to be defined.

modeIsDefined <- function(x) {
  .def <- all(!is.na(centroided(x)))
  if (!.def) stop("The mode of your data is undefined. See ?isCentroided for details.")
  invisible(TRUE)
}

So for pickPeaks, we would have

if (modeIsDefined(object) & centroided(x)) {
  message("Object already centroided")
  return(object)
}

For estimateNoise, we would have

 if (modeIsDefined(object) & !centroided(object)) {
    warning("Noise estimation is only supported for profile spectra.")
    return(matrix(NA, nrow=0L, ncol = 2L, dimnames = list(c(), c("mz", "intensity"))))
  }

Any thoughts?

sgibb commented 8 years ago

Looks good. What about a single function to save some keystrokes and errors (maybe the name is not the best, could lead to confusion with isCentroided)?

isCentroidedMode <- function(x) {
  cd <- centroided(x)
  if (anyNA(cd)) 
    stop("The mode of your data is undefined. See ?isCentroided for details.")
  cd
}
lgatto commented 8 years ago

Or, we could add a strict argument to centroided:

centroided(x) ## NA
centroided(x, strict = TRUE) ## error
sgibb commented 8 years ago

This is even better than a new function. "strict" is a good name for the argument. Nevertheless I like to suggest "na.fail" (default: FALSE) to be in line with stats::na.fail:

centroided(x) ## NA
centroided(x, na.fail=TRUE) ## error