lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
124 stars 50 forks source link

Detect centroid/profile mode #131

Closed lgatto closed 7 years ago

lgatto commented 8 years ago

It would be great to have a function to detect if a spectrum is in profile mode of centroided. Are there any examples of this in the literature? Could we infer this from an mzML/mzXML file?

sgibb commented 8 years ago

At least for mzML there is a cvParam:

In MALDIquant I use some simplistic function to check for centroided data. This function assumes that profile data have a constant or increasing distance between mz_n and mz_n+1 (https://github.com/sgibb/MALDIquant/blob/master/R/irregular-functions.R#L28-L33, https://github.com/sgibb/MALDIquant/blob/master/R/isRegular-methods.R#L20-L26). That works well for most MALDI raw data but fails for some exporting tools, e.g. if you use Bruker's compassXport it removes all mz values with zero intensity which results in larger gaps.

jorainer commented 8 years ago

xcms has also a method implemented:

setMethod("isCentroided", "xcmsRaw", function(object){
    if (length(getScan(object,length(object@scantime) / 2)) >2 ) {
        quantile(diff(getScan(object,length(object@scantime) / 2)[,"mz"]),.25)  > 0.025
    } else {
        TRUE
    }
})

I'm not quite sure how reliably this estimates whether the data is centroided. It just checks that the difference between most of the M/Z values are larger 0.025.

lgatto commented 8 years ago

@jotsetung - given that I am going to add my own isCentroided methods, what about putting the generic

setGeneric("isCentroided", function(object, ...) standardGeneric("isCentroided"))

in ProtGenerics?

jorainer commented 8 years ago

That would be awesome!

lgatto commented 8 years ago

@jotsetung - available in ProtGenerics 1.5.1 (just committed to master and Bioc devel).

lgatto commented 8 years ago

I tried a couple is existing functions: from MALDIquant (but didn't seem to work for non MALDI data), xcms/MSsary (same ones) and I played a bit with alternative implementation. I ended up using the xcms code, but with only the top 10% peaks (other percentages seem to work too, but less is a bit faster).

> MSnbase:::.isCentroided
function (pk, k = 0.025, qtl = 0.9) 
{
    .qtl <- quantile(pk[, 2], qtl)
    x <- sort(pk[pk[, 2] > .qtl, 1])
    quantile(diff(x), 0.25) > k
}
<environment: namespace:MSnbase>

The reason using the top most intense points is necessary is because I want the methods to work on MS1 and MS2 spectra and it to be robust to different peak picking algorithms, and some (msconvert's vendor libraries, for example, leave smaller/noise peaks around the main ones). It's not 100% accurate, but good enough for a start. As mentioned in the manual, for whole experiments, I would take the majority of the results per MS level rather than assigning directly (see the Thermo_Hela_PRTC_1.mzML example below).

> suppressPackageStartupMessages("MSnbase")
[1] "MSnbase"
> ## MS1 FALSE, MS2 TRUE
> x <- readMSData2(f)
Reading 509 spectra from file TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz
> invisible(isCentroided(x))

ctrd      1   2
  FALSE  58   0
  TRUE    0 451
> ## All FALSE
> x <- readMSData2("/home/lg390/Data2/Thermo_HELA_PRT/Thermo_Hela_PRTC_1.mzML")
Reading 21299 spectra from file Thermo_Hela_PRTC_1.mzML
> invisible(isCentroided(x))

ctrd        1     2
  FALSE  8106 13130
  TRUE      0    63
> ## All TRUE - msconvert basic
> x <- readMSData2("/home/lg390/Data2/Thermo_HELA_PRT/out/Thermo_Hela_PRTC_1cent.mzML")
Reading 21299 spectra from file Thermo_Hela_PRTC_1cent.mzML
> invisible(isCentroided(x))

ctrd       1     2
  TRUE  8106 13193
> ## All TRUE - msconvert cwt
> x <- readMSData2("/home/lg390/Data2/Thermo_HELA_PRT/out/Thermo_Hela_PRTC_1cwt.mzML")
Reading 21299 spectra from file Thermo_Hela_PRTC_1cwt.mzML
> invisible(isCentroided(x))

ctrd       1     2
  TRUE  8106 13135
> ## All TRUE, msconvert vendor
> x <- readMSData2("/home/lg390/Data2/Thermo_HELA_PRT/out/Thermo_Hela_PRTC_1vendor.mzML")
Reading 21299 spectra from file Thermo_Hela_PRTC_1vendor.mzML
> invisible(isCentroided(x))

ctrd       1     2
  TRUE  8106 13193

> ## All TRUE
> x <- readMSData2("/home/lg390/Data2/Thermo_HELA_PRT/Thermo_Hela_PRTC_1.mzML")
Reading 21299 spectra from file Thermo_Hela_PRTC_1.mzML
> x <- pickPeaks(x)
> invisible(isCentroided(x))
Apply lazy processing step(s):
 o 'pickPeaks' with 4 argument(s).

ctrd       1     2
  TRUE  8106 13167
jorainer commented 8 years ago

In https://github.com/lgatto/MSnbase/commit/22fef698f0b59ab8494ad3e9a8533f50895507ad I propose an alternative, slightly faster and less memory demanding, isCentroided method. Basically, the isCentroided2 uses the spectrapply method to apply the isCentroided method to each spectrum directly, avoiding to first read all spectra, return the list and use lapply on that list. The benefit comes for larger experiments or more spectra per file.

The benchmark below was run on an OnDiskMSnExp consisting of two mzML files, each having 3670 spectra.

microbenchmark(isCentroided(onDisk), MSnbase:::isCentroided2(onDisk), times = 10)
Unit: seconds
                            expr       min        lq     mean    median
            isCentroided(onDisk) 10.829776 11.102866 11.39593 11.315227
 MSnbase:::isCentroided2(onDisk)  8.044935  8.062418  8.26521  8.217012
        uq       max neval cld
 11.553037 12.641166    10   b
  8.301742  8.901333    10  a 

The big advantage would be for very large experiments (many mzML files) as the isCentroided2 processes each file separately thus there is no need to read the full data in memory (which can become problematic).

lgatto commented 8 years ago

Thanks @jotsetung - I will use for former implementation for an MSnExp method and rename the one using spectrapply isCentroided.

Done is 17b7cc74a6db5e2973ff25abf8c0f710acabe2fc

lgatto commented 8 years ago

I have added the (unexported) .isCentroidedFromFile function that checks the mode of the spectra by parsing the file names. This only works for mzML files and thus returns NA if the file extension doesn't match.

This function needs to be changed as follows: needs to be vectorised and, when called with > 1 file, needs to reorder the results according to the features in the input object.

lgatto commented 7 years ago

Closed with commit 7588383241844cd2cf3ff5df656258ce44b467cc that add the isCentroidedFromFile function for OnDiskMSnExp objects.