Tabular data input - Githubissues

lgatto commented 9 months ago

QFeatures::readQFeatures()
QFeatures::readSummarizedExperiment()
scp::readSCP()
scp::readSCPfromDIANN()

This should ideally be harmonised, possibly all moved to QFeatures (with more standard names), to avoid multiple re-implementations.

lgatto commented 9 months ago

@cvanderaa - I plan to move scp::readSCPfromDIANN() to QFeatures and use the latest data in MsDataHub to show its working. What about renaming it QFeatures::readQFeaturesFromDIANN() and keep a copy as scp::readSCPfromDIANN() pointing to the former's man page.

Edit: there should be one difference between QFeatures::readQFeaturesFromDIANN() and scp::readSCPfromDIANN() though - the former produces a QFeatures of SummarizedExperiment objects, the latter of SingleCellExperiment, which might require some refactoring.

lgatto commented 9 months ago

Here's a proposed hack, with minimal refactoring, to illustrate the desired behaviour (shown in the next comment). See this comment for a description of what needs to be done for a proper implementation.

## in QFeatures - original readSCPfromDIANN implementation
.readQFeaturesFromDIANN <-
    function (colData, reportData, extractedData = NULL,
              ecol = "MS1.Area", multiplexing = "none", ...) {
    diannReportCols <- c("File.Name", "Precursor.Id", "Modified.Sequence")
    if (!all(diannReportCols %in% colnames(reportData)))
        stop("'reportData' is not an expected DIA-NN report table ",
            "output. This function expects the main output file as ",
            "described here: https://github.com/vdemichev/DiaNN#main-output-reference")
    if (!ecol %in% colnames(reportData))
        stop("'", ecol, "' not found in 'reportData'")
    if (!"File.Name" %in% colnames(colData))
        stop("'colData' must contain a column named 'File.Name' that provides ",
            "a link to the 'File.Name' column in 'reportData'")
    if (multiplexing == "none" && !is.null(extractedData))
        stop("Providing 'extractedData' for label-free experiments ",
            "('multiplexed == \"none\"') is not expected. Raise an ",
            "issue if you need this feature: ",
            "https://github.com/UCLouvain-CBIO/scp/issues/new/choose")
    args <- list(...)
    if (multiplexing == "mTRAQ") {
        reportData$Label <- sub("^.*[Q-](\\d).*$", "\\1", reportData$Modified.Sequence)
        reportData$Precursor.Id <- gsub("\\(mTRAQ.*?\\)", "(mTRAQ)",
            reportData$Precursor.Id)
        args$sep <- "."
        if (!"Label" %in% colnames(colData))
            stop("'colData' must contain a column named 'Label' that ",
                "provides the mTRAQ reagent used to label the ",
                "samples and/or single cells.")
        if (any(mis <- !colData$Label %in% reportData$Label)) {
            stop("Some labels from 'colData$Label' were not found as",
                "part of the mTRAQ labels found in ", "'reportData$Modified.Sequence': ",
                paste0(unique(colData$Label[mis]), collapse = ", "))
        }
        nIds <- length(unique(paste0(reportData$Precursor.Id,
            reportData$File.Name)))
        nLevels <- sapply(colnames(reportData), function(x) {
            nrow(unique(reportData[, c("Precursor.Id", "File.Name",
                x)]))
        })
        idCols <- names(nLevels)[nLevels == nIds]
        reportData <- pivot_wider(reportData, id_cols = all_of(idCols),
            names_from = "Label", values_from = ecol)
    }
    else if (multiplexing == "none") {
        colData$Label <- ecol
        args$sep <- ""
        args$suffix <- ""
    }
    else {
        stop("The '", multiplexing, "' multiplexing strategy is not ",
            "implemented. Raise an issue if you need this feature: ",
            "https://github.com/UCLouvain-CBIO/scp/issues/new/choose")
    }
    out <- do.call(readSCP, c(args, list(featureData = reportData,
        colData = colData, batchCol = "File.Name", channelCol = "Label")))
    if (!is.null(extractedData)) {
        labs <- unique(colData$Label)
        quantCols <- grep(paste0("[", paste0(labs, collapse = ""),
            "]$"), colnames(extractedData))
        extractedData <- readSingleCellExperiment(extractedData,
            ecol = quantCols, fnames = "Precursor.Id")
        cnames <- unique(unlist(colnames(out)))
        if (any(mis <- !cnames %in% colnames(extractedData)))
            stop("Some columns present in reportData are not found in ",
                "extracted data", paste0(cnames[mis], collapse = ", "),
                "\nAre you sure the two tables were generated from ",
                "the same experiment?")
        extractedData <- extractedData[, cnames]
        anames <- names(out)
        out <- addAssay(out, extractedData, name = "Ms1Extracted")
        out <- addAssayLink(out, from = anames, to = "Ms1Extracted",
            varFrom = rep("Precursor.Id", length(anames)), varTo = "Precursor.Id")
    }
    out
}

readQFeaturesFromDIANN <-
    function (colData, reportData, extractedData = NULL,
              ecol = "MS1.Area", multiplexing = "none", ...) {
        ans <- .readQFeaturesFromDIANN(colData, reportData, extractedData,
                                       ecol, multiplexing, ...)
        ## The .readQFeaturesFromDIANN() function initially lived as
        ## readSCPfromDIANN() in the scp package, and hence returns
        ## SingleCellExperiment elements. We thus need to convert the
        ## assays to SummarizedExperiment instances.
        el <- ExperimentList(lapply(experiments(ans),
                                    as, "SummarizedExperiment"))
        experiments(ans) <- el
        if (validObject(ans)) ans
    }

## in scp

readSCPfromDIANN <-
        function (colData, reportData, extractedData = NULL,
              ecol = "MS1.Area", multiplexing = "none", ...)
            .readQFeaturesFromDIANN(colData, reportData, extractedData,
                                    ecol, multiplexing, ...)

lgatto commented 9 months ago

And here's an example


> x <- read.delim(benchmarkingDIA.tsv()) ## DIA data
see ?MsDataHub and browseVignettes('MsDataHub') for documentation
loading from cache
> ## fix file names
> x[[1]] <- sub("^.+raw-data\\\\", "", x[[1]])
> cd <- data.frame(File.Name = unique(x[[1]]))
> readSCPfromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
Loading data as a 'SingleCellExperiment' object
Splitting data based on 'File.Name'
Formatting sample metadata (colData)
Formatting data as a 'QFeatures' object
An instance of class QFeatures containing 24 assays:
 [1] RD139_Overlap_UPS1_0_1fmol_inj1.mzML: SingleCellExperiment with 28980 rows and 1 columns 
 [2] RD139_Overlap_UPS1_0_1fmol_inj2.mzML: SingleCellExperiment with 29495 rows and 1 columns 
 [3] RD139_Overlap_UPS1_0_1fmol_inj3.mzML: SingleCellExperiment with 29210 rows and 1 columns 
 ...
 [22] RD139_Overlap_UPS1_5fmol_inj1.mzML: SingleCellExperiment with 30941 rows and 1 columns 
 [23] RD139_Overlap_UPS1_5fmol_inj2.mzML: SingleCellExperiment with 30321 rows and 1 columns 
 [24] RD139_Overlap_UPS1_5fmol_inj3.mzML: SingleCellExperiment with 24168 rows and 1 columns 
> readQFeaturesFromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
readQFeaturesFromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
Loading data as a 'SingleCellExperiment' object
Splitting data based on 'File.Name'
Formatting sample metadata (colData)
Formatting data as a 'QFeatures' object
An instance of class QFeatures containing 24 assays:
 [1] RD139_Overlap_UPS1_0_1fmol_inj1.mzML: SummarizedExperiment with 28980 rows and 1 columns 
 [2] RD139_Overlap_UPS1_0_1fmol_inj2.mzML: SummarizedExperiment with 29495 rows and 1 columns 
 [3] RD139_Overlap_UPS1_0_1fmol_inj3.mzML: SummarizedExperiment with 29210 rows and 1 columns 
 ...
 [22] RD139_Overlap_UPS1_5fmol_inj1.mzML: SummarizedExperiment with 30941 rows and 1 columns 
 [23] RD139_Overlap_UPS1_5fmol_inj2.mzML: SummarizedExperiment with 30321 rows and 1 columns 
 [24] RD139_Overlap_UPS1_5fmol_inj3.mzML: SummarizedExperiment with 24168 rows and 1 columns 
``

lgatto commented 9 months ago

Still need to transfer scp::readSCP() to QFeatures.

@cvanderaa , what do you think:

Should it completely replace readQFeatures()?
readQFeatures() is very easy to use, and works in quite a few cases.
We could consider merging both, but the signature might become a bit complicated, with different arguments matching different uses.
If we want to keep both functions, how to name the new function in QFeatures?

lgatto commented 9 months ago

Ignore the silly implementation in the comment above, but the example still hold.

change readSCP() so that it produces SE. The only change required is to use readSummarizedExperiment() instead of readSingleCellExperiment(). (I will also move and rename .splitSCE() to .splitSE().
This new function can then be used by readQFeaturesFromDIANN() to generate a QFeatures object with SEs. readSCPfromDIANN() would then use that one, and convert elements to SCE objects.

As for the comment above, my preference is to merge readQFeatures() and readSCP() to have both behaviours. The best might even to have methods that dispatch on two arguments, the first one containing a data.frame in both cases (could also be a character with a file name, at least in the simple, current readQFeatures() case), and either a data.frame (current readSCP()), or a vector (current readQFeatures()).

setGeneric("readFOO",
           function(x, y, ...)
               standardGeneric("readFOO"))

setMethod("readFOO",
          c("data.frame", "data.frame"),
          function(x, y)
              message("two dfs"))

setMethod("readFOO",
          c("data.frame", "vector"),
          function(x, y)
              message("one df and one vector"))

setMethod("readFOO",
          c("character", "vector"),
          function(x, y)
              message("one character and one vector"))

readFOO(data.frame(), data.frame())

readFOO(data.frame(), character())
readFOO(data.frame(), numeric())
readFOO(data.frame(), integer())
readFOO(data.frame(), logical())

readFOO(character(), character())

lgatto commented 9 months ago

What annoys me though with this is the naming of the arguments. Often, the names in a generic function are generic, such as object or x and y. The problem is that the first argument refers to the actual data (either as a data.frame, or as the file that contains it), and the second one refers to either the vector ecols or the data.frame colData.

We could simplify this by only supporting a data.frame like object for the first argument. That first argument is currently called table (single-assay case) and featureData (multi-assay case). This argument could be called x for generality, but given that the second one should be more explicit than y(see next point), featureData or assayData would also work.
The second argument is called ecol (single-assay case) and colData (multi-assay case). Instead of calling it y, it could be called colAnnotation, as it either annotates the columns to be used to populate the assay, or the colData.

lgatto commented 9 months ago

Last point, I will also add a ecol argument for backwards compatibility in the single-set case.

lgatto commented 9 months ago

Work in progress in branch issue199: https://github.com/rformassspectrometry/QFeatures/tree/issue199

lgatto commented 8 months ago

Merged, but see here for more todos

[x] names = "psms" is not a good idea (eg "precursor" is more relevant to DIA), I suggest to use "quants" instead, but I'm open to discussion.
[x] I thoroughly unit tested readQFeatures, but we could also unit test the new internal functions for safety
[x] Documentation (man + vignette) still needs to be adapted
[x] readQFeaturesFromDIANN still needs to be refactored, that is create a function that converts DIA-NN report tables in the expected assayData tables.

lgatto commented 8 months ago

@cvanderaa - could you review the readQFeatures man page, when you have some time.

rformassspectrometry / QFeatures

Tabular data input #199