add a function to average columns of a SummarizedExperiment

jorainer commented 7 months ago

In some cases we might want to average (technical) replicates (columns) of an SummarizedExperiment. Why a SummarizedExperiment:

in addition to the abundance matrix it provides also information on samples
it is a container for the preprocessing results - averaging replicates during preprocessing does not make much sense.

The function (method?) should average all assays and also update the colData after reducing/combining the data of the technical replicates.

A template for the function could be the averageSE() function, but I would change the name (maybe to averageColumns()?) and maybe see if we could improve it a bit? documentation needs to be definitely improved...

Also, to avoid adding the SummarizedExperiment package as a dependency to MetaboCoreUtils we should rather use requireNamespace() and call the SummarizedExperiment-specific functions with SummarizedExperiment::.

lgatto commented 7 months ago

This is also something that is in the browing in QFeatures. We already have aggregateFeatures() for rows, but want something similar for pseudo-bulking.

The name averageSE() isn't a good fit IMHO, way to general: it would bette to make an explicit reference to columns, and there might be other options that averaging.
In QFeatures, we will implement methods for SummarizedExperiment and QFeatures objects.
It is important to keep a consistent API (see here, something we should discuss more widely for RforMS) - in this case, make sure the signature/arguments are consistent with aggregateFeatures().
There's also probably some code-reuse and refactoring opportunities with aggregateFeatures()

I would very strongly suggest to add this to QFeatures. Any low-level function, that operates on the matrix could end up in MsCoreUtils.

Also adding @cvanderaa to the discussion.

cvanderaa commented 7 months ago

Indeed, you can find an initial discussion on the matter here: https://github.com/rformassspectrometry/QFeatures/issues/188

jorainer commented 7 months ago

yeeees, just discussed also with @philouail - and we agreed that it would be better to have that in a more high-level package instead of MetaboCoreUtils! So 100% for adding this to QFeatures.

Also, I totally agree that we need to change the name. Maybe aggregateColumns() to use a similar naming?

Would I would like from this method/function:

allow to specify which function to use for aggregating (mean, median etc)
in addition to report the aggregated values, also add the possibility to report something on the variance of the aggregated values (sd, mad...)

jorainer commented 7 months ago

I move this issue to QFeatures then.

lgatto commented 7 months ago

The name aggregateColumns() looks good to me.
Yes, the fun argument allows to define the aggregation function in aggregateFeatures()
aggregateFeatures() stores the number of features that were aggregated in an assay called aggcounts, along the aggregated assay data. The same could be done for other summary statistics.

jorainer commented 7 months ago

I don't know what you originally thought of this function, but in my use cases I would use it like this:

f <- factor(se$sample_id)  # sample_id is the unique identifier for biological samples. have technical replicates of the same sample
f[se$sample_id == "blank"] <- NA  # we want to exclude/drop blanks here; no need to aggregate them
f <- droplevels(f)
se_avg <- aggregateColumns(se, f = f, aggregate = mean, variance = sd)

So, the se_avg would be a QFeatures/SummarizedExperiment where the assays() contain the aggregated assays from the original se. And in addition, it could have, for each assay, one assay with information on the variance - since it would have the same dimension.

I don't think it's required to keep then everything in the same QFeatures/SummarizedExperiment object. The result I would like to have is then simply a QFeatures/SummarizedExperiment with the aggregated data - for which also the colData() was also aggregated. IMHO keeping the original and aggregated data within the same object would make the object unnecessarily more complicated - if we need that, why not adding that functionality to the MsExperiment? we could have several QFeatures within that object, i.e. the original and the aggregated...

jorainer commented 7 months ago

@cvanderaa , @lgatto , could you maybe explain how you envisioned the function? and/or maybe some use case(s)? We could then agree on a consensus :)

cvanderaa commented 7 months ago

In my opinion, aggregateColumns() should be as close to as aggregateFeatures() since they conceptually should do the same thing except using another margin (columns and rows, respectively). If you agree on this, then:

There should an aggregateColumns() method that takes a QFeatures object and returns a QFeatures object, and an aggregateColumns() method that takes a SummarizedExperiment object and returns a SummarizedExperiment object.
The signature of the aggregateColumns() should be consistent with aggregateFeatures(), that is: aggregateColumns(QFobject, i, fcol, name, fun, ...) and aggregateColumns(SEobject, fcol, fun, ...).

Regarding the functionality of providing more assays, as Laurent mentioned, aggregateFeatures() automatically adds an assay (called aggcounts) with the number of observations used to generate each aggregated data point. But you want more/other functions such as sd, mad (I could also think of coefficient of variation cv). One solution I could think of is to provide a new argument, eg called moreFuns, that contains a list of functions, with default: moreFun = list(aggcounts = MsCoreUtils::colCounts). Each function would lead to adding an assay to the aggregated set, named after the list names.

Update: if we implement this additional argument, we need to think about how to retrieve the supplementary assays.

Regarding the colData, aggregated columns should lead to new names hence to new rows in the colData. Hence we could simply populate the new colData rows with the sample annotations that is consistent across the samples used for each aggregated sample. However, this would not be efficient (any annotation that is not consistent will be filled with NAs), but problably not a big problem. Also the link between which samples correspond to which aggregated sample is not stored (unlike for features with AssayLinks). I think this is the major bottleneck, which (I think) would imply a major refactoring of colData management within a QFeatures. Therefore, at least as a first step, we could simply ignore implementing an aggregateColumns() method that takes a QFeatures object and returns a QFeatures object, and only focus only on SummarizedExperiment.

jorainer commented 6 months ago

thanks for the explanation - sounds rather complex. maybe we should make a dev-call to discuss this "in person".

rformassspectrometry / QFeatures

add a function to average columns of a SummarizedExperiment #215