rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

Tabular data input #199

Closed lgatto closed 7 months ago

lgatto commented 9 months ago

This should ideally be harmonised, possibly all moved to QFeatures (with more standard names), to avoid multiple re-implementations.

lgatto commented 9 months ago

@cvanderaa - I plan to move scp::readSCPfromDIANN() to QFeatures and use the latest data in MsDataHub to show its working. What about renaming it QFeatures::readQFeaturesFromDIANN() and keep a copy as scp::readSCPfromDIANN() pointing to the former's man page.

Edit: there should be one difference between QFeatures::readQFeaturesFromDIANN() and scp::readSCPfromDIANN() though - the former produces a QFeatures of SummarizedExperiment objects, the latter of SingleCellExperiment, which might require some refactoring.

lgatto commented 9 months ago

Here's a proposed hack, with minimal refactoring, to illustrate the desired behaviour (shown in the next comment). See this comment for a description of what needs to be done for a proper implementation.

## in QFeatures - original readSCPfromDIANN implementation
.readQFeaturesFromDIANN <-
    function (colData, reportData, extractedData = NULL,
              ecol = "MS1.Area", multiplexing = "none", ...) {
    diannReportCols <- c("File.Name", "Precursor.Id", "Modified.Sequence")
    if (!all(diannReportCols %in% colnames(reportData)))
        stop("'reportData' is not an expected DIA-NN report table ",
            "output. This function expects the main output file as ",
            "described here: https://github.com/vdemichev/DiaNN#main-output-reference")
    if (!ecol %in% colnames(reportData))
        stop("'", ecol, "' not found in 'reportData'")
    if (!"File.Name" %in% colnames(colData))
        stop("'colData' must contain a column named 'File.Name' that provides ",
            "a link to the 'File.Name' column in 'reportData'")
    if (multiplexing == "none" && !is.null(extractedData))
        stop("Providing 'extractedData' for label-free experiments ",
            "('multiplexed == \"none\"') is not expected. Raise an ",
            "issue if you need this feature: ",
            "https://github.com/UCLouvain-CBIO/scp/issues/new/choose")
    args <- list(...)
    if (multiplexing == "mTRAQ") {
        reportData$Label <- sub("^.*[Q-](\\d).*$", "\\1", reportData$Modified.Sequence)
        reportData$Precursor.Id <- gsub("\\(mTRAQ.*?\\)", "(mTRAQ)",
            reportData$Precursor.Id)
        args$sep <- "."
        if (!"Label" %in% colnames(colData))
            stop("'colData' must contain a column named 'Label' that ",
                "provides the mTRAQ reagent used to label the ",
                "samples and/or single cells.")
        if (any(mis <- !colData$Label %in% reportData$Label)) {
            stop("Some labels from 'colData$Label' were not found as",
                "part of the mTRAQ labels found in ", "'reportData$Modified.Sequence': ",
                paste0(unique(colData$Label[mis]), collapse = ", "))
        }
        nIds <- length(unique(paste0(reportData$Precursor.Id,
            reportData$File.Name)))
        nLevels <- sapply(colnames(reportData), function(x) {
            nrow(unique(reportData[, c("Precursor.Id", "File.Name",
                x)]))
        })
        idCols <- names(nLevels)[nLevels == nIds]
        reportData <- pivot_wider(reportData, id_cols = all_of(idCols),
            names_from = "Label", values_from = ecol)
    }
    else if (multiplexing == "none") {
        colData$Label <- ecol
        args$sep <- ""
        args$suffix <- ""
    }
    else {
        stop("The '", multiplexing, "' multiplexing strategy is not ",
            "implemented. Raise an issue if you need this feature: ",
            "https://github.com/UCLouvain-CBIO/scp/issues/new/choose")
    }
    out <- do.call(readSCP, c(args, list(featureData = reportData,
        colData = colData, batchCol = "File.Name", channelCol = "Label")))
    if (!is.null(extractedData)) {
        labs <- unique(colData$Label)
        quantCols <- grep(paste0("[", paste0(labs, collapse = ""),
            "]$"), colnames(extractedData))
        extractedData <- readSingleCellExperiment(extractedData,
            ecol = quantCols, fnames = "Precursor.Id")
        cnames <- unique(unlist(colnames(out)))
        if (any(mis <- !cnames %in% colnames(extractedData)))
            stop("Some columns present in reportData are not found in ",
                "extracted data", paste0(cnames[mis], collapse = ", "),
                "\nAre you sure the two tables were generated from ",
                "the same experiment?")
        extractedData <- extractedData[, cnames]
        anames <- names(out)
        out <- addAssay(out, extractedData, name = "Ms1Extracted")
        out <- addAssayLink(out, from = anames, to = "Ms1Extracted",
            varFrom = rep("Precursor.Id", length(anames)), varTo = "Precursor.Id")
    }
    out
}

readQFeaturesFromDIANN <-
    function (colData, reportData, extractedData = NULL,
              ecol = "MS1.Area", multiplexing = "none", ...) {
        ans <- .readQFeaturesFromDIANN(colData, reportData, extractedData,
                                       ecol, multiplexing, ...)
        ## The .readQFeaturesFromDIANN() function initially lived as
        ## readSCPfromDIANN() in the scp package, and hence returns
        ## SingleCellExperiment elements. We thus need to convert the
        ## assays to SummarizedExperiment instances.
        el <- ExperimentList(lapply(experiments(ans),
                                    as, "SummarizedExperiment"))
        experiments(ans) <- el
        if (validObject(ans)) ans
    }
## in scp

readSCPfromDIANN <-
        function (colData, reportData, extractedData = NULL,
              ecol = "MS1.Area", multiplexing = "none", ...)
            .readQFeaturesFromDIANN(colData, reportData, extractedData,
                                    ecol, multiplexing, ...)
lgatto commented 9 months ago

And here's an example


> x <- read.delim(benchmarkingDIA.tsv()) ## DIA data
see ?MsDataHub and browseVignettes('MsDataHub') for documentation
loading from cache
> ## fix file names
> x[[1]] <- sub("^.+raw-data\\\\", "", x[[1]])
> cd <- data.frame(File.Name = unique(x[[1]]))
> readSCPfromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
Loading data as a 'SingleCellExperiment' object
Splitting data based on 'File.Name'
Formatting sample metadata (colData)
Formatting data as a 'QFeatures' object
An instance of class QFeatures containing 24 assays:
 [1] RD139_Overlap_UPS1_0_1fmol_inj1.mzML: SingleCellExperiment with 28980 rows and 1 columns 
 [2] RD139_Overlap_UPS1_0_1fmol_inj2.mzML: SingleCellExperiment with 29495 rows and 1 columns 
 [3] RD139_Overlap_UPS1_0_1fmol_inj3.mzML: SingleCellExperiment with 29210 rows and 1 columns 
 ...
 [22] RD139_Overlap_UPS1_5fmol_inj1.mzML: SingleCellExperiment with 30941 rows and 1 columns 
 [23] RD139_Overlap_UPS1_5fmol_inj2.mzML: SingleCellExperiment with 30321 rows and 1 columns 
 [24] RD139_Overlap_UPS1_5fmol_inj3.mzML: SingleCellExperiment with 24168 rows and 1 columns 
> readQFeaturesFromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
readQFeaturesFromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
Loading data as a 'SingleCellExperiment' object
Splitting data based on 'File.Name'
Formatting sample metadata (colData)
Formatting data as a 'QFeatures' object
An instance of class QFeatures containing 24 assays:
 [1] RD139_Overlap_UPS1_0_1fmol_inj1.mzML: SummarizedExperiment with 28980 rows and 1 columns 
 [2] RD139_Overlap_UPS1_0_1fmol_inj2.mzML: SummarizedExperiment with 29495 rows and 1 columns 
 [3] RD139_Overlap_UPS1_0_1fmol_inj3.mzML: SummarizedExperiment with 29210 rows and 1 columns 
 ...
 [22] RD139_Overlap_UPS1_5fmol_inj1.mzML: SummarizedExperiment with 30941 rows and 1 columns 
 [23] RD139_Overlap_UPS1_5fmol_inj2.mzML: SummarizedExperiment with 30321 rows and 1 columns 
 [24] RD139_Overlap_UPS1_5fmol_inj3.mzML: SummarizedExperiment with 24168 rows and 1 columns 
``
lgatto commented 9 months ago

Still need to transfer scp::readSCP() to QFeatures.

@cvanderaa , what do you think:

lgatto commented 9 months ago

Ignore the silly implementation in the comment above, but the example still hold.

As for the comment above, my preference is to merge readQFeatures() and readSCP() to have both behaviours. The best might even to have methods that dispatch on two arguments, the first one containing a data.frame in both cases (could also be a character with a file name, at least in the simple, current readQFeatures() case), and either a data.frame (current readSCP()), or a vector (current readQFeatures()).

setGeneric("readFOO",
           function(x, y, ...)
               standardGeneric("readFOO"))

setMethod("readFOO",
          c("data.frame", "data.frame"),
          function(x, y)
              message("two dfs"))

setMethod("readFOO",
          c("data.frame", "vector"),
          function(x, y)
              message("one df and one vector"))

setMethod("readFOO",
          c("character", "vector"),
          function(x, y)
              message("one character and one vector"))

readFOO(data.frame(), data.frame())

readFOO(data.frame(), character())
readFOO(data.frame(), numeric())
readFOO(data.frame(), integer())
readFOO(data.frame(), logical())

readFOO(character(), character())
lgatto commented 9 months ago

What annoys me though with this is the naming of the arguments. Often, the names in a generic function are generic, such as object or x and y. The problem is that the first argument refers to the actual data (either as a data.frame, or as the file that contains it), and the second one refers to either the vector ecols or the data.frame colData.

lgatto commented 9 months ago

Last point, I will also add a ecol argument for backwards compatibility in the single-set case.

lgatto commented 9 months ago

Work in progress in branch issue199: https://github.com/rformassspectrometry/QFeatures/tree/issue199

lgatto commented 8 months ago

Merged, but see here for more todos

lgatto commented 8 months ago

@cvanderaa - could you review the readQFeatures man page, when you have some time.