rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

Feature request: aggregate by sample (i.e. column) #188

Open csdaw opened 1 year ago

csdaw commented 1 year ago

Hi there,

I'm not sure if this functionality already exists, but I'd like to propose a function aggregateSamples() which is companion to aggregateFeatures(), and would do essentially the same thing but would combine columns instead of rows. I'm currently using this to calculate log2 SILAC ratios, but it would have other general uses such as calculating row means.

Thanks!

Functions

suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(QFeatures)
})

# Define 'column-wise' version of MsCoreUtils::aggregate_by_vector()
aggregate_cols_by_vector <- function(x, INDEX, FUN, ...) {
  ## Input checks go here

  FUN <- match.fun(FUN)

  res <- lapply(
    split(seq_len(ncol(x)), INDEX),
    FUN = function(j) FUN(x[, j, drop = FALSE], ...)
  )
  nms <- names(res)
  res <- do.call(cbind, res)
  colnames(res) <- nms
  rownames(res) <- rownames(x)

  ## HDF5Matrix check goes here
  res
}

# .aggregateSamples, essentally a 'column-wise' version of .aggregateQFeatures
.aggregateSamples <- function(object, scol, fun, ...) {
  if (missing(scol))
    stop("'scol' is required")
  m <- assay(object, 1)
  cd <- colData(object)
  if (!scol %in% names(cd))
    stop("'scol' not found in the assay's colData.")
  groupBy <- cd[[scol]]

  ## Store class of assay i in case it is not a SummarizedExperiment
  ## so that the aggregated assay can be reverted to that class
  .class <- class(object)

  ## Message about NA values in quant/col data goes here...

  if (is.vector(groupBy)) {
    aggregated_assay <- aggregate_cols_by_vector(m, groupBy, fun, ...)
    aggcount_assay <- aggregate_cols_by_vector(m, groupBy, MatrixGenerics::rowCounts)
    aggregated_coldata <- QFeatures::reduceDataFrame(cd, cd[[scol]],
                                                     simplify = TRUE,
                                                     drop = TRUE,
                                                     count = TRUE)
    assays <- SimpleList(assay = aggregated_assay, aggcounts = aggcount_assay)
    coldata <- aggregated_coldata[colnames(aggregated_assay), , drop = FALSE]
  } else {
    stop("'scol' must be a vector")
  }
  se <- SummarizedExperiment(assays = assays,
                             colData = coldata,
                             rowData = rowData(object))

  ## SummarizedExperiment class check goes here...

  return(se)
}

# Define generic function and QFeatures method
setGeneric("aggregateSamples", function(object, ...) standardGeneric("aggregateSamples"))
#> [1] "aggregateSamples"

setMethod("aggregateSamples", "QFeatures",
          function(object, i, scol, name = "newAssay",
                   fun = base::rowSums, ...) {
            if (isEmpty(object))
              return(object)
            ## Check arguments
            if (any(present <- name %in% names(object)))
              stop("There's already one or more assays named: '",
                   paste0(name[present], collapse = "', '"), "'.")
            i <- QFeatures:::.normIndex(object, i)
            if (length(i) != length(name)) stop("'i' and 'name' must have same length")
            if (length(scol) == 1) scol <- rep(scol, length(i))
            if (length(i) != length(scol)) stop("'ii and 'scol' must have same length")
            ## Aggregate each assay
            for (j in seq_along(i)) {
              from <- i[[j]]
              to <- name[[j]]
              by <- scol[[j]]

              ## Create the aggregated assay
              aggAssay <- .aggregateSamples(object[[from]],
                                            by, fun, ...)
              ## Add the assay to the QFeatures object
              object <- addAssay(object, aggAssay, name = to)
              ## Link the input assay to the aggregated assay
              ## No new rows created, therefore linking is automatic!
            }
            object
          })

Usage example


# Make example SummarizedExperiment
m1 <- matrix(
  rep(1:4, each = 5),
  ncol = 4
)

sample_names <- c("c1", "c2", "t1", "t2")
feature_names <- paste0("pep", 1:5)

colnames(m1) <- sample_names
rownames(m1) <- feature_names

cd1 <- data.frame(
  treatment = rep(c("control", "treatment"), each = 2)
)

rownames(cd1) <- sample_names

se1 <- SummarizedExperiment(m1, colData = cd1)

assay(se1)
#>      c1 c2 t1 t2
#> pep1  1  2  3  4
#> pep2  1  2  3  4
#> pep3  1  2  3  4
#> pep4  1  2  3  4
#> pep5  1  2  3  4
colData(se1)
#> DataFrame with 4 rows and 1 column
#>      treatment
#>    <character>
#> c1     control
#> c2     control
#> t1   treatment
#> t2   treatment

# Make example QFeatures
qf1 <- QFeatures(list(peptides = se1))

qf1
#> An instance of class QFeatures containing 1 assays:
#>  [1] peptides: SummarizedExperiment with 5 rows and 4 columns

# Aggregate samples
qf1 <- aggregateSamples(qf1, i = "peptides", scol = "treatment", name = "sums", fun = rowSums)

qf1
#> An instance of class QFeatures containing 2 assays:
#>  [1] peptides: SummarizedExperiment with 5 rows and 4 columns 
#>  [2] sums: SummarizedExperiment with 5 rows and 2 columns
assay(qf1[['sums']])
#>      control treatment
#> pep1       3         7
#> pep2       3         7
#> pep3       3         7
#> pep4       3         7
#> pep5       3         7
lgatto commented 1 year ago

Hi Charlotte.

Thank you for the feature request - I indeed see some utility in the request, even though I am not sure that the current suggestion/implementation is adapted to QFeatures instances:

  1. One fundamental feature of the aggregateFeatures() function and the whole QFeatures class is that the link between features are recorded.
  2. After calling aggregateFeatures(), we still have features (precursors, peptides, proteins), and don't fundamentally change the nature of these rows.

Both of these rules would be broken with your request:

  1. There is no tracking along the columns (i.e. AssayLink)
  2. We don't have samples anymore.

The second point is merely a change in semantics (that would need some reflection), but the former seems essential to comply with the spirit of the package. It would be possible to implement (but probably not trivial), but we don't have the time at the moment and given point 2, I am not convinced it is advised (without more motivated use cases, at least).

However, in the light of the discussion above, I think it would make sense to add your request to the packge and to return a SummarizedExperiment, and leave it to the user to decide what to do with it, i.e. either continue with the SE, or start a new QFeatures object with a new set of columns/(aggregated) samples.

-> added to the TODO list.

csdaw commented 1 year ago

Thank you for taking a look! On your points:

  1. I agree it would be more time/trouble than its worth to extend AssayLink to work on columns as well.

  2. I just wasn't sure there is a standard way to refer to 'columns in an assay' so I defaulted to 'sample' which seems to be what is used in MultiAssayExperiment docs.

I hope to have some time to submit a PR once I've wrapped up experiments.

cvanderaa commented 5 months ago

Hello @csdaw,

Thanks a lot for this concrete suggestion and implementation. I wanted you to know that I couldn't find the time to look into it yet, but this request is very relevant and I did not forget it.

To complement the discussion, here are two additional use cases where aggregating samples would come very handy:

  1. pseudo-bulking single-cell data, aggregate multiple cells within a cluster (eg cell type) within a patient (or other biological replication). This could facilitate assessing patient-level contrasts from single-cell data.
  2. Aggregate technical replicates (eg same sample acquired twice through LC-MS) into a single observation.

I'll be working on a project to evaluate the performance 1., so I plan to have a look at this issue soon.

csdaw commented 5 months ago

Thanks @cvanderaa, good to hear there are other applications this could facilitate. I'm happy to test anything you come up with if that helps.