rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

`runCol` argument in `readQFeatures()` #216

Open samgregoire opened 4 months ago

samgregoire commented 4 months ago

Hello,

I have an issue / question regarding the runCol argument from the readQFeatures() function. My sample annotations are stored in a data.frame called meta. The column containing the name of the runs from which the samples come from is called fileID. I thought the purpose of the runCol argument was to identify the column where the names of the runs are stored. However, when I set runCol = "fileID", I get this error saying that the 'colData' must contain a column called 'runCol'.

> readQFeatures(assayData = psm,
+               colData = meta,
+               runCol = "fileID",
+               quantCols = "quant",
+               sep = "_",
+               verbose = TRUE)
Checking arguments.
Error in .checkRunCol(assayData, colData, runCol) : 
  When 'runCol' is not NULL, 'colData' must contain a column called 'runCol'.

If I change the column name in my sample annotations from "fileID" to "runCol", everything works and my colData are exactly as I want them to be.

> colnames(meta)[colnames(meta) == "fileID"] <- "runCol"
> readQFeatures(assayData = psm,
+               colData = meta,
+               runCol = "fileID",
+               quantCols = "quant",
+               sep = "_",
+               verbose = TRUE)
Checking arguments.
Loading data as a 'SummarizedExperiment' object.
Splitting data in runs.
Formatting sample annotations (colData).
Formatting data as a 'QFeatures' object.
An instance of class QFeatures containing 215 assays:

Since there is a runCol argument, shouldn't I be free to name my "run" column whatever I want in the metadata as long as I indicate it through the said runCol argument? With the current implementation of the readQFeatures() function, it seems that I need the column to be imperatively named runCol.

From what I found, the QFeatures:::.checkRunCol() function is responsible for the error because if (!"runCol" %in% colnames(colData)) induces a stop.

function (assayData, colData, runCol) 
{
    if (is.null(runCol)) 
        return(NULL)
    if (length(runCol) > 1) 
        stop("'runCol' must contain the name of a single column ", 
            "in 'assayData'.")
    if (!runCol %in% colnames(assayData)) 
        stop("'", runCol, "' (provided as 'runCol') not found ", 
            "in 'assayData'.")
    runs <- assayData[[runCol]]
    if (!is.null(colData)) {
        if (!"runCol" %in% colnames(colData)) 
            stop("When 'runCol' is not NULL, 'colData' must ", 
                "contain a column called 'runCol'.")
        mis <- !runs %in% colData$runCol
        if (any(mis)) {
            warning("Some runs are missing in 'colData': ", paste0(unique(runs[mis]), 
                collapse = ", "))
        }
    }
    assayData[[runCol]]
}

I think that removing the quotation marks in the condition inducing the stop and using colData[[runCol]] instead of colData$runCol would solve this issue, unless of course that's how runCol is supposed to behave.

PS. The QFeatures:::.formatColData() function also uses the colData$runCol notation.

cvanderaa commented 4 months ago

(ping @leopoldguyot who raised the same issue to us)

When implementing this, I had in mind that we have to impose a "runCol" in the colData, but reading again the documentation, I get the confusion. In the param description:

runCol: For the multi-set case, a numeric(1) or character(1) pointing to the column of assayData (and colData, is set) that contains the runs/batches. Make sure that the column name in both tables are identical and syntactically valid (if you supply a character) or have the same index (if you supply a numeric). Note that characters are converted to syntactically valid names using make.names

Indeed, the doc mentions here that runCol must be the same across the two tables, but if the argument runCol is something else than "runCol", then the two table (in the current implementation) cannot share the same columns...

Later in the Details section:

Multi-set case: the colData must contain a column named quantCols that provides the names of the columns in assayData with the quantitative values for each sample, and a column named runCol that provides the MS runs/batches in which each sample has been acquired. The entries in colData[["runCol"]] are matched against the entries provided by assayData[[runCol]].

The first sentence mentions runCol as a variable (so can be any character) while the second sentence indicates that the column named "runCol" is matched, so here the name is imposed.

I agree this is super confusing and must be solved indeed, 2 solutions:

  1. As you mention, adapt .checkRunCol() so to allow that runCol points to a column name that is shared across the two tables. For instance, if runCol == "FileID, then readQFeatures() should expect that assayData and colData both contain a column named FileID.
  2. Fix the documentation to clarify that colData must contain a column named "runCol"

After discussion with Léopold and now your issue, you have convinced me of going for 1., but @lgatto what's your opinion?