New/update quantify method

lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics

http://lgatto.github.io/MSnbase/

124 stars 50 forks source link

New/update quantify method #297

Open lgatto opened 6 years ago

lgatto commented 6 years ago

Given the new OnDiskMSnExp infrastructure, it is time to upgrade the quantify method. I am considering the following signature (only focusing on the most important arguments for now):

quantify <- function(object, msLevel,
                     reporters,
                     method)

with the following behaviour:

## msLevel:
## 1 -> see method
## 2 -> see method: either spectral counting (if reporters is
##                  missing) or isobaric labelling (requires reporters)
## 3 -> isobaric labelling, requires reporters

Another possibility would be to have a QuantitationParams class (with essentially the arguments shown above) and pre-defined instances for the most used methods. At this point it feels overkill, but could be a way to more easily extend quantification parametrisation in the future. This could by the way also be useful for xcms.

Another point I am considering is to allow to either returns an MSnSet containing the feature of the MS level defined as argument to quantify (current behaviour), or to return a new OnDiskMSnExp and the quantification matrix (for all feature) would be stored in a matrix in the assayData slot, similar to what xcms does in the XCMSnExp object.

@sgibb @jotsetung - any comments or suggestions?

jorainer commented 6 years ago

Re QuantitationParams class. In xcms I am using parameter classes to switch between analysis methods (such as CentWaveParam, ObiwarpParam etc). I think that works nicely, especially because I can put these parameter classes into the process history - so it is possible to exactly reproduce the analysis based on the result object alone. What was an overkill is that I implemented getter and setter methods for all slots of each parameter class - would never ever do that again.

Re OnDiskMSnExp as a result object. As you know I am using that in xcms. I find that useful, because it allows, although you have the processed/quantified feature data, to access the original raw data at any point - and the OnDiskMSnExp is only very lightweight anyway. So I like the idea.

lgatto commented 6 years ago

I think it would be good to keep MSnbase and xcms similar, hence maybe having QuantitationParams for both is good - especially that MS1 quantitation will be using xcms in the future.

I will into your XCMSnExp again and see to what extend it can be used or adapted.

lgatto commented 6 years ago

@jotsetung - a few questions following up from the discussion above.

In xcms, features are extracted from an OnDiskMSnExp to create an XCMSnExp that contains the additional MsFeatureData environment. What it the method/function that creates and XCMSnExp from an OnDiskMSnExp? It would be nice to keep some consistency.

What would you think of me reusing the XCMSnExp for my quantitation purpose? If so, would you consider renaming it something like FeatureMSnExp or QuantMSnExp. Or I could create that class that inherits from XCMSnExp without adding anything and go from there. I know there will be some cyclic dependencies issues, but it's good to think about sharing infrastructure early on.

Re parameters above, I saw that you have the Params/GenericParam classes, and a whole range of specialisations thereof for CentWave, MatchedFilter, ... My idea was to have a QuantitationParam and then specialisations for LFQ, TMT, iTRAQ, spectral counting, ... Do you think it would be worth having some link, such as mine inheriting from GenericParam? This would make label-free quantitation using xcms more coherent - I envision something like

qx <- quantify(x, method = "CentWaveParam")

jorainer commented 6 years ago

Problem is that the quantification in xcms is a multi-step approach: 1) chromatographic peak detection (within each sample): findChromPeak. 2) alignment (retention time adjustment between samples): adjustRtime 3) correspondence (grouping of chromatographic peaks across samples): groupChromPeaks.

Already step 1) returns an XCMSnExp object that contains the chromatographic peak table in the MsFeatureData. (optional) Step 2) does add adjusted retention times to the MsFeatureData and step 3) groups peaks across samples and does the quantification (adds the feature definitions, which is in essence just a DataFrame with indices linking to the individual chromatographic peaks belonging to the same feature). The quantification matrix can then be extracted using the featureValues method that builds the matrix on-the-fly by extracting the corresponding quantifications from the chromatographic peak table.

The idea of having a single quantify method to do all of this might be tricky. For applying quantification using the yams package from the Hansen group it would however work (might be something I would like to implement anyway) as they're not doing any alignment etc.

Now re quantify(x, param = CentWaveParam()), this would do an identification of chromatographic peaks within each sample - but there is no way yet to generate a matrix-like quantification matrix as the peaks have still to be matched across samples (by groupChromPeaks). Problem is really that xcms is designed as a multi-step approach.

To me it would make sense to apply a quantification method once the features have been defined (i.e. the 2d areas on the rt-m/z space). This can be done with a combination of findChromPeaks and groupChromPeaks. An alternative possibility would be to define yet another parameter class that holds several Param classes. e.g. an XCMSParam class with a parameter class for peak detection, one (optional) for alignment and one for correspondence. quantify with such a parameter class would then apply all the necessary steps sequentially and return the object with the quantification data (not necessarily an XCMSnExp object, see below).

Re re-using the XCMSnExp for you purpose depends on what you want to have. I think you want to have an object with a two-dimensional quantification matrix - that's not in XCMSnExp. What if you would define a similar object (just with a two-dimensional quantification matrix) and quantify would turn an XCMSnExp into your object (dropping all the information about individual chromatographic peaks etc)?

lgatto commented 6 years ago

The idea of having a single quantify method to do all of this might be tricky.

That wasn't my intention. Now that I understand better, I would see something like

xcms: OnDiskMSnExp -(step1)-(step2)-(step3)-> XCMSnExp
MSnbase: OnDiskMSnExp -quantify-> SomethingMSnExp (that basically does the same as now, but keeping quantitation and raw data together, at least temporarily).
MSnbase using xcms: to be determined, but probably 3 steps with intermediates, and possibly with step 3 being quantify, or a ParamsList, or 1 quantify step using YamsParam, leading to SomethingMSnExp

Re re-using the XCMSnExp for you purpose depends on what you want to have. I think you want to have an object with a two-dimensional quantification matrix - that's not in XCMSnExp.

No, the 2-dimensional object is an MSnSet, which is here to stay. I want an OnDiskMSnExp that also contains a matrix with quantified features in SomethingMSnExp. Then, one can look and the quantified features and raw data easily and, when ready, proceed with only quantitative data using as(mySomethingMSnExp, "MSnSet").

I hope this clarifies what I am looking for. I think these needs are essentially what you have.

jorainer commented 6 years ago

Looks good to me! I think the ParamList containing the parameter classes for the 3 steps is a not too bad idea (could be called xcmsParam). And the YamsParam that does the quantification in one go. I like the idea behind yams and that it takes a different approach than xcms. Only caveat is that the data should be nicely aligned for yams.

Also for me it might be helpful to have then a as(XCMSnExp, "MSnSet") to get rid of the raw data in the end. Good that we begin to merge metabolomics and proteomics workflows/classes etc. The next thing would then be the identification of the features - for an initial identification based on m/z I've started something over at https://github.com/EuracBiomedicalResearch/CompoundDb . It's a simple database for compounds.

lgatto commented 6 years ago

Also for me it might be helpful to have then a as(XCMSnExp, "MSnSet") to get rid of the raw data in the end.

That was implied in my as(SomeethingMSnExp, "MSnSet") - as SomethingMSnExp would essentially be an XCMSnExp, i.e. an OnDiskMSnExp with a feature/quantitation environment.

If this looks good an feasible, my next concrete question is:

What about moving XCMSnExp to MSnbase and possibly renaming it FeatureMSnExp, QuantMSnExp, ... and use it as XCMSnExp in xcms. I can't do it the other way due to circular dependencies.

Or do you see another possibility?

Same for the Params class, but that's more of a detail now.

Good that we begin to merge metabolomics and proteomics workflows/classes etc

Yes!

lgatto commented 6 years ago

See quant2.R for a testing prototype.

sgibb commented 6 years ago

I really like the idea to merge metabolomics and proteomics workflows and to use a QuantitationParam class for the quantitation settings.

I want an OnDiskMSnExp that also contains a matrix with quantified features in SomethingMSnExp

While it would be great to link quantitation data to raw data I am against the MonsterMSnExp that contains just everything. IMHO it works against our idea we discussed at EuroBioc2017 that we want to develop an universal infrastructure to link quant data to raw data to sequence data to genomic data etc. I think it would be better to keep small specialized objects ((OnDisk)MSnExp, MSnSet).

BTW: Do we want to start a repository for this linkage stuff to discuss goal/problems from time to time? (Starting point could be the document you wrote at EuroBioc2017.)

jorainer commented 6 years ago

What about moving XCMSnExp to MSnbase and possibly renaming it FeatureMSnExp, QuantMSnExp, ... and use it as XCMSnExp in xcms. I can't do it the other way due to circular dependencies.

In principle there shouldn't be a big problem moving the object - there will be stuff that you have to implement in MSnbase (e.g. how to add or access the matrix with quantification data) and stuff that I would like to keep in xcms (all the stuff related to adjusted retention times etc). We would have to define beforehand what has to be moved (depends on what you need in MSnbase). One thing regarding the name: the idea was that you have the MSnExp object for mass spec proteomics and the XCMSnExp the extension for LC/GC-MS (hence XC + MSnExp). No problem moving the Param and GenericParam classes.

Re @sgibb about the monster object: for me it was key to have both the link to the raw and quantified data in the same object. I really need to get back to exactly the same data on which the quantification was performed to check if the e.g. identified peaks are well aligned across samples etc. Having that in different objects would make it very hard (this is how it was in the old xcms and the main reason why I wanted to change and update it).

lgatto commented 6 years ago

@sgibb

I am against the MonsterMSnExp

In principle, I agree. But at this point, given that xcms has taken that route (for very sensible reasons) and has something that suites my immediate needs, I think it is a reasonable path for MSnbase too. It could be also be done with two objects, making sure they have compatible dimensions. But at this stage, having matching metabolomics and proteomics pipelines is a desirable feature.

The long-term goals, as discussed at EuroBioc2017, is to integrate raw data, quantitation, protein database, peptide sequences, ... and I don't want to rush this. The r-bioc-ms-challenges repo can be used for discussions.

I am open for discussion though. I will think about keeping MSnSet and OnDiskMSnExp separate and maintain the dimensions compatible, which isn't the case now as the letter stores all MS levels.

lgatto commented 6 years ago

We would have to define beforehand what has to be moved (depends on what you need in MSnbase).

I would have thought the class and accessors would suffice, and all the xcms specific processing would stay, of course.

Re naming, I got the reason of the XC, but it is confusing due to xcms. Would it be possible to name it something more general in MSnbase and use it as XCMSnExp in xcms? Something along the lines of

name1 <- name2 <- ...

for functions. XCMSnExp wouldn't be exported from MSnbase, only from xcms.

jorainer commented 6 years ago

Is it possible to have the same object with two different names? Also here, I don't have any strong feeling regarding the name of the object - XCMSnExp is very hard to pronounce anyway.

Why I'm not euphoric about moving the object is that, as it is now, it is hardly of any use for you - because you don't need the adjusted retention times, the matrix of identified chromatographic peaks and the DataFrame with feature definitions. What you would need is a matrix with quantitation data, which the object at present does not have. Wouldn't it be easier to implement a new object? Coercing from an XCMSnExp to that would then be easy, because the matrix returned by featureValues,XCMSnExp would be the quantification matrix.

If you're already thinking that you will need the object to implement the label free identification in MSnbase - for that you would need to import then all the analysis methods from xcms and you'll end up in circular dependencies. I think that should be implemented in xcms.

lgatto commented 6 years ago

Is XCMSnExp not an OnDiskMSnExp with the MsFeatureData environment (and a list for the processing history)?

In you case, the environment contains adjusted retention times, the matrix of identified chromatographic peaks and the DataFrame with feature definitions - accessed with the dedicates accessors; in my case that environment would contain a matrix accessed by exprs.

Basically, I would do

setClass("FeatureMSnExp",
         slots = c(
             .processHistory = "list", ## optional
             msFeatureData = "MsFeatureData"
         ), 
         contains = c("OnDiskMSnExp"))

setMethod("[", "FeatureMSnExp", 
   ## also subset object@msFeatureData$exprs
)

setMethod("exprs", "FeatureMSnExp", 
   function(object) object@msFeatureData$exprs)

setMethod("show", "FeatureMSnExp", 
   ## just say that there's something in the msFeatureData slot, 
   ## possibly say what it is, then call show,OnDiskMSnExp
)

## all the other accessors would be inherited form OnDiskMSnExp

For a later tighter integration, I think it would be good that XCMSnExp and FeatureMSnExp wouldn't be completely independent. Let's sleep over it and think about it again. I can work around this for now anyway.

jorainer commented 6 years ago

Ah, and here comes the trouble: I have a [ method implemented that we would have to adopt:

#' @description The \code{[} method allows to subset a \code{\link{XCMSnExp}}
#'     object by spectra. Be aware that the \code{[} method removes all
#'     preprocessing results, except adjusted retention times if
#'     \code{keepAdjustedRtime = TRUE} is passed to the method.
#'
#' @param x For \code{[} and \code{[[}: an \code{\link{XCMSnExp}} object.
#'
#' @param i For \code{[}: \code{numeric} or \code{logical} vector specifying to
#'     which spectra the data set should be reduced.
#'     For \code{[[}: a single integer or character.
#'
#' @param j For \code{[} and \code{[[}: not supported.
#'
#' @param drop For \code{[} and \code{[[}: not supported.
#'
#' @rdname XCMSnExp-filter-methods
setMethod("[", "XCMSnExp", function(x, i, j, ..., drop = TRUE) {
    if (!missing(j))
        stop("subsetting by columns ('j') not supported")
    if (missing(i))
        return(x)
    else if (!(is.numeric(i) | is.logical(i)))
        stop("'i' has to be either numeric or logical")
    ## Check if we have keepAdjustedRtime as an additional parameter
    ## in ...
    keepAdjustedRtime <- list(...)$ke
    if (is.null(keepAdjustedRtime))
        keepAdjustedRtime <- FALSE
    if (hasFeatures(x) | hasChromPeaks(x)) {
        suppressMessages(
            x <- dropFeatureDefinitions(x, keepAdjustedRtime =
                                               keepAdjustedRtime))
        suppressMessages(
            x <- dropChromPeaks(x, keepAdjustedRtime =
                                       keepAdjustedRtime))
        warning("Removed preprocessing results")
    }
    if (hasAdjustedRtime(x)) {
        if (keepAdjustedRtime) {
            ## Subset the adjusted rtime
            new_adj <- rtime(x, adjusted = TRUE)[i]
            newFd <- new("MsFeatureData")
            newFd@.xData <- .copy_env(x@msFeatureData)        
            adjustedRtime(newFd) <-
                unname(split(new_adj, f = fromFile(x)[i]))
            lockEnvironment(newFd, bindings = TRUE)
            x@msFeatureData <- newFd
        } else {
            suppressMessages(x <- dropAdjustedRtime(x))
        }
    }
    callNextMethod()
})

so, if there is a exprs subset that, and then call all of the other subsettings above. same is true for the filter... methods. For XCMSnExp they are all overwriting the OnDiskMSnExp implementation to ensure that the data remains consistent. To me it would make sense to keep these methods in xcms or eventually override the ones exported from MSnExp. All this stuff that I have to do to keep the adjusted retention times, identified chromatographic peaks and feature definitions aligned and consistent.

Actually, what if we keep the XCMSnExp that extends your FeatureMSnExp object? That way we could keep the same methods and keep the logic (i.e. to subset the exprs and/or all the xcms stuff) separate?

lgatto commented 6 years ago

Here's what I was thinking of: we would have a top class in MSnbase

setClass("FeatureMSnExp",
         slots = c(
             .processHistory = "list", ## optional
             msFeatureData = "MsFeatureData"
         ), 
         representation = "VIRTUAL",
         contains = c("OnDiskMSnExp"))
## possibly a show,FeatureMSnExp method
## common infrastructure inherited from OnDiskMSnExp

and in MSnbase

setClass("MSnExpSet", contains = "FeatureMSnExp") ## horrible name
setMethod("[", "MSnExpSet", ...) ## accessing exprs
## all the specific proteomics processing methods/functions

and in xcms

setClass("XCMSnExp", contains = "FeatureMSnExp")
setMethod("[", "XCMSnExp", ...) ## accessing other elements
## all the specific metabolomics processing methods/functions

This allows us to keep a link between the MSnbase and xcms specific implementations of FeatureMSnExp, yet these two having different data in the MsFeatureData slot and different accessors. I am just unsure yet to what extend this will be useful. But I don't think it will make anything more difficult.

Let's see tomorrow if this still makes sense...

jorainer commented 6 years ago

Looks good to me. So, you're moving the MsFeatureData to MSnbase, right?

lgatto commented 6 years ago

So, you're moving the MsFeatureData to MSnbase, right?

From the above, I rather meant FeatureMSnExp (a renamed virutal XCMSnExp) and MsFeatureData, and we can implement our respective XCMSnExp (with all it already has) and SomethingMSnExp (with exprs).

I won't have time to further test/think about this today, so no hurry.

lgatto commented 6 years ago

I am going to wait before doing this, that I have the new interface with QuantitationParam in place.