Open jorainer opened 6 months ago
This is also something that is in the browing in QFeatures
. We already have aggregateFeatures()
for rows, but want something similar for pseudo-bulking.
averageSE()
isn't a good fit IMHO, way to general: it would bette to make an explicit reference to columns, and there might be other options that averaging. QFeatures
, we will implement methods for SummarizedExperiment
and QFeatures
objects. aggregateFeatures()
. aggregateFeatures()
I would very strongly suggest to add this to QFeatures
. Any low-level function, that operates on the matrix could end up in MsCoreUtils
.
Also adding @cvanderaa to the discussion.
Indeed, you can find an initial discussion on the matter here: https://github.com/rformassspectrometry/QFeatures/issues/188
yeeees, just discussed also with @philouail - and we agreed that it would be better to have that in a more high-level package instead of MetaboCoreUtils
! So 100% for adding this to QFeatures
.
Also, I totally agree that we need to change the name. Maybe aggregateColumns()
to use a similar naming?
Would I would like from this method/function:
mean
, median
etc)sd
, mad
...)I move this issue to QFeatures
then.
aggregateColumns()
looks good to me. fun
argument allows to define the aggregation function in aggregateFeatures()
aggregateFeatures()
stores the number of features that were aggregated in an assay called aggcounts
, along the aggregated assay data. The same could be done for other summary statistics. I don't know what you originally thought of this function, but in my use cases I would use it like this:
f <- factor(se$sample_id) # sample_id is the unique identifier for biological samples. have technical replicates of the same sample
f[se$sample_id == "blank"] <- NA # we want to exclude/drop blanks here; no need to aggregate them
f <- droplevels(f)
se_avg <- aggregateColumns(se, f = f, aggregate = mean, variance = sd)
So, the se_avg
would be a QFeatures
/SummarizedExperiment
where the assays()
contain the aggregated assays from the original se
. And in addition, it could have, for each assay, one assay with information on the variance - since it would have the same dimension.
I don't think it's required to keep then everything in the same QFeatures
/SummarizedExperiment
object. The result I would like to have is then simply a QFeatures
/SummarizedExperiment
with the aggregated data - for which also the colData()
was also aggregated. IMHO keeping the original and aggregated data within the same object would make the object unnecessarily more complicated - if we need that, why not adding that functionality to the MsExperiment
? we could have several QFeatures
within that object, i.e. the original and the aggregated...
@cvanderaa , @lgatto , could you maybe explain how you envisioned the function? and/or maybe some use case(s)? We could then agree on a consensus :)
In my opinion, aggregateColumns()
should be as close to as aggregateFeatures()
since they conceptually should do the same thing except using another margin (columns and rows, respectively). If you agree on this, then:
aggregateColumns()
method that takes a QFeatures
object and returns a QFeatures
object, and an aggregateColumns()
method that takes a SummarizedExperiment
object and returns a SummarizedExperiment
object.aggregateColumns()
should be consistent with aggregateFeatures()
, that is: aggregateColumns(QFobject, i, fcol, name, fun, ...)
and aggregateColumns(SEobject, fcol, fun, ...)
.Regarding the functionality of providing more assays, as Laurent mentioned, aggregateFeatures()
automatically adds an assay (called aggcounts
) with the number of observations used to generate each aggregated data point. But you want more/other functions such as sd
, mad
(I could also think of coefficient of variation cv
). One solution I could think of is to provide a new argument, eg called moreFuns
, that contains a list of functions, with default: moreFun = list(aggcounts = MsCoreUtils::colCounts)
. Each function would lead to adding an assay to the aggregated set, named after the list names.
Update: if we implement this additional argument, we need to think about how to retrieve the supplementary assays.
Regarding the colData, aggregated columns should lead to new names hence to new rows in the colData. Hence we could simply populate the new colData rows with the sample annotations that is consistent across the samples used for each aggregated sample. However, this would not be efficient (any annotation that is not consistent will be filled with NAs), but problably not a big problem. Also the link between which samples correspond to which aggregated sample is not stored (unlike for features with AssayLinks). I think this is the major bottleneck, which (I think) would imply a major refactoring of colData management within a QFeatures. Therefore, at least as a first step, we could simply ignore implementing an aggregateColumns()
method that takes a QFeatures
object and returns a QFeatures
object, and only focus only on SummarizedExperiment
.
thanks for the explanation - sounds rather complex. maybe we should make a dev-call to discuss this "in person".
In some cases we might want to average (technical) replicates (columns) of an
SummarizedExperiment
. Why aSummarizedExperiment
:The function (method?) should average all assays and also update the
colData
after reducing/combining the data of the technical replicates.A template for the function could be the
averageSE()
function, but I would change the name (maybe toaverageColumns()
?) and maybe see if we could improve it a bit? documentation needs to be definitely improved...Also, to avoid adding the
SummarizedExperiment
package as a dependency to MetaboCoreUtils we should rather userequireNamespace()
and call theSummarizedExperiment
-specific functions withSummarizedExperiment::
.