Design notes - Githubissues

sa-lee commented 5 years ago

@lawremi's notes moved over from the plyranges wiki

Design Goals

We have basically the same goals as for (G)Ranges: make SummarizedExperiment easier to use.

Data Structures

The trouble with SE is that it is a collection of tables, like a database, not just a simple single table. There is a lot of pressure to denormalize SE into a table, so that it folds into existing R infrastructure that operates on tables. Meanwhile, we do not want to discard the semantic notions around feature-by-sample assay data coupled with metadata on the features (rows) and samples (columns).

There is a real risk of breaking the constraints that support those semantics, at least higher than when treating GRanges as a table. For example, while it is possible for someone to attempt to drop the "start" column from a GRanges, it should be fairly obvious that doing so will either fail or drop the GRanges to a tibble. However, in the case of SE, a user could, for example, break the rectangularity of the assay matrix in subtle ways. How can we help the user avoid mistakes?

One way would be force the user to be explicit about features and samples. Besides preserving data integrity, this would be beneficial in communicating semantics.

  filter_features(x, pathway=="Glycolysis")
  filter_samples(x, treatment=="A")

The semantics of SE are actually more general than features and samples. For example, the metadata accessors are =rowData()= and =colData()=, but it would be confusing to use those terms when they are inconsistent with the tabular view we are presenting.

The denormalization really becomes a problem when working with assay data. Let us assume the user wants to filter the features that have an average expression above some value. The assay values should then be implicitly grouped by feature; otherwise, the user needs to do more work and will make mistakes doing that work. If we stored the assay values in arrays (as they are now), then the user could do:

  filter_features(x, rowMeans(exprs) > 5)

But if we wanted to abstract away the array notion, we might have:

  filter_features(x, mean(exprs) > 5)

We have effectively grouped the assay data by feature, when filtering features. We could do the opposite when filtering samples, such as when restricting to samples where at least half of their values are non-zero:

  filter_samples(x, mean(exprs == 0L) >= 0.5)

When filtering by sample, we group by sample. When filtering by feature, we group by feature.

This is similar to the OLAP cube approach, where the user is ultimately interested in a denormalized table, but the data model preserves the structure in order to interpret high-level queries.

The current /RangedSummarizedExperiment/ API directly supports many range operations. Do we want to do the same, for convenience and consistency? Semantically, SE a different beast than a GRanges --- primarily experimental data, with metadata, some of which happen to be ranges. It is the data, not the ranges, that are primary. If we do not support direct range operations, then we will need an accessors to get and set the ranges.

Construction

Ideally users do not have to directly construct these objects, because they are derived from instrument output, or curated resources. There really is no standard mechanism for communicating data corresponding to a SummarizedExperiment, but Bioconductor provides interfaces that map different data sources to SE objects.

Restriction

Should be able to restrict by row or column. As indicated above, the assay values should probably be grouped for convenience.

High-level API:

Features :: =filter_features()=
Samples :: =filter_samples()=

Aggregation

Should support aggregation by feature (row) or sample (column). A common use case of feature aggregation is moving from transcripts to genes, or genes to pathway. Sample aggregation generally happens through linear modeling, i.e., sample is converted to contrast. No, there is nothing wrong with the samples in an SE corresponding to contrasts. But simple sample-level aggregations might also make sense, for example, over technical replicates.

We should support grouping and summarizing over either dimension:

=group_samples()=, =summarize_samples()=
=group_features()=, =summarize_features()=

If we explicitly group by feature (or sample), then the assay values should be implicitly and additionally grouped by sample (or feature). This lets the user write:
```
summarize_features(x, exprs=mean(exprs))
```
Instead of:
```
summarize_features(x, exprs=colMeans(exprs))
```
Or is that being too smart?

Merging

We might want to merge:

Feature metadata
Sample metadata
Experiments as a whole

This suggests having join variants for each of those.

Sorting

Sorting could happen in either dimension, suggesting:

=arrange_features()=
=arrange_samples()=

Accessors

SE consists of multiple components and it sometimes will be desirable to manipulate them independently. That means we will need accessors for components like:

Feature metadata
Sample metadata
Specific assays

Ranges

For example,

set_feature_data(x, feature_data(x) %>% mutate(...) %>% etc)

LTLA commented 5 years ago

I'm curious as to whether dplyr will interact sensibly with more complex S4 objects that can occasionally be stored as fields in the colData() of a SE object. For example, scater::calculateQCMetrics can store the QC metrics as a hierarchy of nested DFs when compact=TRUE.

lawremi commented 5 years ago

Ideally one of the dplyr implementations on top of the base R API should help there, since Bioc faithfully implements the base API.

lazappi commented 5 years ago

Just thought I would mention that alternative idea to having different functions for rows and columns (filter_rows, filter_cols etc.) is the activate verb in the tidygraph package to switch between contexts https://github.com/thomasp85/tidygraph. Not sure how that would work in this context but it might be something to consider.

sa-lee / plyexperiment

Design notes #1