single-cell-data / SOMA

A flexible and extensible API for annotated 2D matrix data stored in multiple underlying formats.
MIT License
72 stars 10 forks source link

Filtered matrices implemented as views of raw matrices #4

Open ambrosejcarr opened 2 years ago

ambrosejcarr commented 2 years ago

From @nlhepler

It would be cool/efficient if there was an option for filtered matrices to refer to the original sc_group dataframe(s) along with the selection(s) along each axis, so that associated metadata is not duplicated, and relationships are preserved.

falexwolf commented 2 years ago

Since early on in Scanpy workflows, we expressed filtering as masking instead of actually modifying data. This comes down to storing masks in .var and .obs with canonical names.

For instance, filtering by highly variable genes through sc.pp.highly_variable_genes adds a field highly_variable to .var, which masks all genes of low variance.

Whether tools rather adopt implementing "filtering through masking" vs. "tracing filtering in connecting sc_groups" has probably several pros and cons. One advantage of the former is that it avoids duplication of data.

joshua-d-campbell commented 2 years ago

This is an interesting problem which we tried to tackle in the ExperimentSubset package for the Bioconductor toolchains. The major limitation of just using a masking for cells/features is when a new annotation gets produced that is only the length of the masked object. For example, if the obs data frame is the length of the original raw matrix but certain QC tools only work on the non-empty drops (e.g. doublet detection). You could subset the matrix based on the empty drop masking and run the tool, but then your new vector of doublet calls is only the length of the non-empty droplets and not the length of the original raw matrix. If you wanted to store back in the original obs data frame, you would have to add NAs or something which is very undesirable IMO. Thus, having a separate obs/var data frame for each subset is advantageous from that perspective.

The downside of using sc_groups for different subsets is redundancy of the data as @falexwolf mentioned. We got around this by making a special subset class in which the new subsetted matrix is actually just a pointer to the parent matrix with the indices. Then a new SCE object is used for that subset which has the ability to store additional matrices that are the same dimension as the subset as well as the new obs (colData) and vars (rowData) which match the subset.

falexwolf commented 2 years ago

Thank you for the interesting reference and expounding on this, @joshua-d-campbell!

I agree with the points made!

I just want to make a connection with the scoping discussion of the present repository, for instance, here.

  1. Should data provenance tooling as in ExperimentSubet (connecting & representing different subsets through pointers) be a part of the fundamental data structure & matrix-api?
  2. Or does one merely want to foresee metadata slots that will enable plugging in different data provenance tools that establish pointers between datasets & groups?

In both cases (1 & 2), it'd make sense to account for ExperimentSubset (either for designing the slot of the metadata schema or for foreseeing some hard-coded pointer structure related to subsetting). One should probably also look at other conventions and tools around data provenance.

joshua-d-campbell commented 2 years ago

It is definitely a good question. I don't think data provenance just for "record keeping" sake needs to be a part of the first iteration of the matrix api, although it may be nice to include in a future version. The only reason we have to use pointers in the ExperimentSubset class is to eliminate data redundancy between the subset and the original matrix (i.e. we don't actually create a new matrix when creating a subset, we just point to the subset of rows/columns in the parent matrix). The provenance that comes with the pointer is just a nice bonus. However, it may a bit easier for you in the first round of the api implementation just to have redundant data in the between the original matrix and the subset. Although I'm still looking over things, my initial vote would be to have each new subset be a new sc_group so that way all sc_groups have the same set of cells/features which would match with the cells/features in the obs/var data frames for that sc_group.

Then ExperimentSubset-like functionality can be added in the future without the user really having to know that there was a change (i.e. they don't need to know whether a copy of the data is being stored in a subset or if it is just a point to data in the original matrix). The major work is really in the accessor functions that retrieve data in the matrix. They need to know whether to grab the data as usual or go grab a subset of the data from the parent matrix.

stavrospapadopoulos commented 2 years ago

The way we (TileDB) intend to address this issue generically (i.e., for all applications we model as arrays) is via a feature we are currently working on and call array views. These will be similar to non-materialized table views, quite common in Databases. In addition to facilitating the use cases described in this thread, it will be quite useful for us for implementing fine-grained access policies on arrays (data owners will be able to share any number of different array views, instead of the original array).

joshua-d-campbell commented 2 years ago

Just to further help describe the challenge, here is a visual summary of a real-world use case where subsetting is done for both QC at the initial stages but also at later stages to re-analyze subsets of cells. Note that separate cell annotations (i.e. cluster labels) were produced for each of the later subsets and therefore each subset needed its own obs data frame. Although per @stavrospapadopoulos, it sounds like it could be handled on the backend with non-materialized table views (assuming that there is a way to store annotation matrices like the obs without having a corresponding materialized data matrix), it is not entirely clear to me how the current API description handles this more complex use case.

ivirshup commented 2 years ago

CC @adamgayoso, who has been looking into approaches for this around scvi-tools

pablo-gar commented 1 year ago

As of today this has not been implemented yet and there are no plans to do so in the near term. However it is worth maintaining it open in case we would want to revisit it later