Closed lgatto closed 7 months ago
@cvanderaa - I plan to move scp::readSCPfromDIANN()
to QFeatures
and use the latest data in MsDataHub
to show its working. What about renaming it QFeatures::readQFeaturesFromDIANN()
and keep a copy as scp::readSCPfromDIANN()
pointing to the former's man page.
Edit: there should be one difference between QFeatures::readQFeaturesFromDIANN()
and scp::readSCPfromDIANN()
though - the former produces a QFeatures
of SummarizedExperiment
objects, the latter of SingleCellExperiment
, which might require some refactoring.
Here's a proposed hack, with minimal refactoring, to illustrate the desired behaviour (shown in the next comment). See this comment for a description of what needs to be done for a proper implementation.
## in QFeatures - original readSCPfromDIANN implementation
.readQFeaturesFromDIANN <-
function (colData, reportData, extractedData = NULL,
ecol = "MS1.Area", multiplexing = "none", ...) {
diannReportCols <- c("File.Name", "Precursor.Id", "Modified.Sequence")
if (!all(diannReportCols %in% colnames(reportData)))
stop("'reportData' is not an expected DIA-NN report table ",
"output. This function expects the main output file as ",
"described here: https://github.com/vdemichev/DiaNN#main-output-reference")
if (!ecol %in% colnames(reportData))
stop("'", ecol, "' not found in 'reportData'")
if (!"File.Name" %in% colnames(colData))
stop("'colData' must contain a column named 'File.Name' that provides ",
"a link to the 'File.Name' column in 'reportData'")
if (multiplexing == "none" && !is.null(extractedData))
stop("Providing 'extractedData' for label-free experiments ",
"('multiplexed == \"none\"') is not expected. Raise an ",
"issue if you need this feature: ",
"https://github.com/UCLouvain-CBIO/scp/issues/new/choose")
args <- list(...)
if (multiplexing == "mTRAQ") {
reportData$Label <- sub("^.*[Q-](\\d).*$", "\\1", reportData$Modified.Sequence)
reportData$Precursor.Id <- gsub("\\(mTRAQ.*?\\)", "(mTRAQ)",
reportData$Precursor.Id)
args$sep <- "."
if (!"Label" %in% colnames(colData))
stop("'colData' must contain a column named 'Label' that ",
"provides the mTRAQ reagent used to label the ",
"samples and/or single cells.")
if (any(mis <- !colData$Label %in% reportData$Label)) {
stop("Some labels from 'colData$Label' were not found as",
"part of the mTRAQ labels found in ", "'reportData$Modified.Sequence': ",
paste0(unique(colData$Label[mis]), collapse = ", "))
}
nIds <- length(unique(paste0(reportData$Precursor.Id,
reportData$File.Name)))
nLevels <- sapply(colnames(reportData), function(x) {
nrow(unique(reportData[, c("Precursor.Id", "File.Name",
x)]))
})
idCols <- names(nLevels)[nLevels == nIds]
reportData <- pivot_wider(reportData, id_cols = all_of(idCols),
names_from = "Label", values_from = ecol)
}
else if (multiplexing == "none") {
colData$Label <- ecol
args$sep <- ""
args$suffix <- ""
}
else {
stop("The '", multiplexing, "' multiplexing strategy is not ",
"implemented. Raise an issue if you need this feature: ",
"https://github.com/UCLouvain-CBIO/scp/issues/new/choose")
}
out <- do.call(readSCP, c(args, list(featureData = reportData,
colData = colData, batchCol = "File.Name", channelCol = "Label")))
if (!is.null(extractedData)) {
labs <- unique(colData$Label)
quantCols <- grep(paste0("[", paste0(labs, collapse = ""),
"]$"), colnames(extractedData))
extractedData <- readSingleCellExperiment(extractedData,
ecol = quantCols, fnames = "Precursor.Id")
cnames <- unique(unlist(colnames(out)))
if (any(mis <- !cnames %in% colnames(extractedData)))
stop("Some columns present in reportData are not found in ",
"extracted data", paste0(cnames[mis], collapse = ", "),
"\nAre you sure the two tables were generated from ",
"the same experiment?")
extractedData <- extractedData[, cnames]
anames <- names(out)
out <- addAssay(out, extractedData, name = "Ms1Extracted")
out <- addAssayLink(out, from = anames, to = "Ms1Extracted",
varFrom = rep("Precursor.Id", length(anames)), varTo = "Precursor.Id")
}
out
}
readQFeaturesFromDIANN <-
function (colData, reportData, extractedData = NULL,
ecol = "MS1.Area", multiplexing = "none", ...) {
ans <- .readQFeaturesFromDIANN(colData, reportData, extractedData,
ecol, multiplexing, ...)
## The .readQFeaturesFromDIANN() function initially lived as
## readSCPfromDIANN() in the scp package, and hence returns
## SingleCellExperiment elements. We thus need to convert the
## assays to SummarizedExperiment instances.
el <- ExperimentList(lapply(experiments(ans),
as, "SummarizedExperiment"))
experiments(ans) <- el
if (validObject(ans)) ans
}
## in scp
readSCPfromDIANN <-
function (colData, reportData, extractedData = NULL,
ecol = "MS1.Area", multiplexing = "none", ...)
.readQFeaturesFromDIANN(colData, reportData, extractedData,
ecol, multiplexing, ...)
And here's an example
> x <- read.delim(benchmarkingDIA.tsv()) ## DIA data
see ?MsDataHub and browseVignettes('MsDataHub') for documentation
loading from cache
> ## fix file names
> x[[1]] <- sub("^.+raw-data\\\\", "", x[[1]])
> cd <- data.frame(File.Name = unique(x[[1]]))
> readSCPfromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
Loading data as a 'SingleCellExperiment' object
Splitting data based on 'File.Name'
Formatting sample metadata (colData)
Formatting data as a 'QFeatures' object
An instance of class QFeatures containing 24 assays:
[1] RD139_Overlap_UPS1_0_1fmol_inj1.mzML: SingleCellExperiment with 28980 rows and 1 columns
[2] RD139_Overlap_UPS1_0_1fmol_inj2.mzML: SingleCellExperiment with 29495 rows and 1 columns
[3] RD139_Overlap_UPS1_0_1fmol_inj3.mzML: SingleCellExperiment with 29210 rows and 1 columns
...
[22] RD139_Overlap_UPS1_5fmol_inj1.mzML: SingleCellExperiment with 30941 rows and 1 columns
[23] RD139_Overlap_UPS1_5fmol_inj2.mzML: SingleCellExperiment with 30321 rows and 1 columns
[24] RD139_Overlap_UPS1_5fmol_inj3.mzML: SingleCellExperiment with 24168 rows and 1 columns
> readQFeaturesFromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
readQFeaturesFromDIANN(colData = cd, reportData = x, ecol = "Ms1.Area")
Loading data as a 'SingleCellExperiment' object
Splitting data based on 'File.Name'
Formatting sample metadata (colData)
Formatting data as a 'QFeatures' object
An instance of class QFeatures containing 24 assays:
[1] RD139_Overlap_UPS1_0_1fmol_inj1.mzML: SummarizedExperiment with 28980 rows and 1 columns
[2] RD139_Overlap_UPS1_0_1fmol_inj2.mzML: SummarizedExperiment with 29495 rows and 1 columns
[3] RD139_Overlap_UPS1_0_1fmol_inj3.mzML: SummarizedExperiment with 29210 rows and 1 columns
...
[22] RD139_Overlap_UPS1_5fmol_inj1.mzML: SummarizedExperiment with 30941 rows and 1 columns
[23] RD139_Overlap_UPS1_5fmol_inj2.mzML: SummarizedExperiment with 30321 rows and 1 columns
[24] RD139_Overlap_UPS1_5fmol_inj3.mzML: SummarizedExperiment with 24168 rows and 1 columns
``
Still need to transfer scp::readSCP()
to QFeatures
.
@cvanderaa , what do you think:
readQFeatures()
? readQFeatures()
is very easy to use, and works in quite a few cases. QFeatures
? Ignore the silly implementation in the comment above, but the example still hold.
readSCP()
so that it produces SE. The only change required is to use readSummarizedExperiment()
instead of readSingleCellExperiment()
. (I will also move and rename .splitSCE()
to .splitSE()
.readQFeaturesFromDIANN()
to generate a QFeatures
object with SEs. readSCPfromDIANN()
would then use that one, and convert elements to SCE objects.As for the comment above, my preference is to merge readQFeatures()
and readSCP()
to have both behaviours. The best might even to have methods that dispatch on two arguments, the first one containing a data.frame in both
cases (could also be a character with a file name, at least in the simple, current readQFeatures()
case), and either a data.frame (current readSCP()
), or a vector (current readQFeatures()
).
setGeneric("readFOO",
function(x, y, ...)
standardGeneric("readFOO"))
setMethod("readFOO",
c("data.frame", "data.frame"),
function(x, y)
message("two dfs"))
setMethod("readFOO",
c("data.frame", "vector"),
function(x, y)
message("one df and one vector"))
setMethod("readFOO",
c("character", "vector"),
function(x, y)
message("one character and one vector"))
readFOO(data.frame(), data.frame())
readFOO(data.frame(), character())
readFOO(data.frame(), numeric())
readFOO(data.frame(), integer())
readFOO(data.frame(), logical())
readFOO(character(), character())
What annoys me though with this is the naming of the arguments. Often, the names in a generic function are generic, such as object
or x
and y
. The problem is that the first argument refers to the actual data (either as a data.frame, or as the file that contains it), and the second one refers to either the vector ecols
or the data.frame colData
.
We could simplify this by only supporting a data.frame
like object for the first argument. That first argument is currently called table
(single-assay case) and featureData
(multi-assay case). This argument could be called x
for generality, but given that the second one should be more explicit than y
(see next point), featureData
or assayData
would also work.
The second argument is called ecol
(single-assay case) and colData
(multi-assay case). Instead of calling it y
, it could be called colAnnotation
, as it either annotates the columns to be used to populate the assay, or the colData.
Last point, I will also add a ecol
argument for backwards compatibility in the single-set case.
Work in progress in branch issue199: https://github.com/rformassspectrometry/QFeatures/tree/issue199
Merged, but see here for more todos
readQFeaturesFromDIANN
still needs to be refactored, that is create a function that converts DIA-NN report tables in the expected assayData tables. @cvanderaa - could you review the readQFeatures man page, when you have some time.
QFeatures::readQFeatures()
QFeatures::readSummarizedExperiment()
scp::readSCP()
scp::readSCPfromDIANN()
This should ideally be harmonised, possibly all moved to
QFeatures
(with more standard names), to avoid multiple re-implementations.