Open ambrosejcarr opened 2 years ago
Since early on in Scanpy workflows, we expressed filtering as masking instead of actually modifying data. This comes down to storing masks in .var
and .obs
with canonical names.
For instance, filtering by highly variable genes through sc.pp.highly_variable_genes
adds a field highly_variable
to .var
, which masks all genes of low variance.
Whether tools rather adopt implementing "filtering through masking" vs. "tracing filtering in connecting sc_groups" has probably several pros and cons. One advantage of the former is that it avoids duplication of data.
This is an interesting problem which we tried to tackle in the ExperimentSubset package for the Bioconductor toolchains. The major limitation of just using a masking for cells/features is when a new annotation gets produced that is only the length of the masked object. For example, if the obs data frame is the length of the original raw matrix but certain QC tools only work on the non-empty drops (e.g. doublet detection). You could subset the matrix based on the empty drop masking and run the tool, but then your new vector of doublet calls is only the length of the non-empty droplets and not the length of the original raw matrix. If you wanted to store back in the original obs data frame, you would have to add NAs or something which is very undesirable IMO. Thus, having a separate obs/var data frame for each subset is advantageous from that perspective.
The downside of using sc_groups for different subsets is redundancy of the data as @falexwolf mentioned. We got around this by making a special subset class in which the new subsetted matrix is actually just a pointer to the parent matrix with the indices. Then a new SCE object is used for that subset which has the ability to store additional matrices that are the same dimension as the subset as well as the new obs (colData) and vars (rowData) which match the subset.
Thank you for the interesting reference and expounding on this, @joshua-d-campbell!
I agree with the points made!
I just want to make a connection with the scoping discussion of the present repository, for instance, here.
In both cases (1 & 2), it'd make sense to account for ExperimentSubset (either for designing the slot of the metadata schema or for foreseeing some hard-coded pointer structure related to subsetting). One should probably also look at other conventions and tools around data provenance.
It is definitely a good question. I don't think data provenance just for "record keeping" sake needs to be a part of the first iteration of the matrix api, although it may be nice to include in a future version. The only reason we have to use pointers in the ExperimentSubset class is to eliminate data redundancy between the subset and the original matrix (i.e. we don't actually create a new matrix when creating a subset, we just point to the subset of rows/columns in the parent matrix). The provenance that comes with the pointer is just a nice bonus. However, it may a bit easier for you in the first round of the api implementation just to have redundant data in the between the original matrix and the subset. Although I'm still looking over things, my initial vote would be to have each new subset be a new sc_group so that way all sc_groups have the same set of cells/features which would match with the cells/features in the obs/var data frames for that sc_group.
Then ExperimentSubset-like functionality can be added in the future without the user really having to know that there was a change (i.e. they don't need to know whether a copy of the data is being stored in a subset or if it is just a point to data in the original matrix). The major work is really in the accessor functions that retrieve data in the matrix. They need to know whether to grab the data as usual or go grab a subset of the data from the parent matrix.
The way we (TileDB) intend to address this issue generically (i.e., for all applications we model as arrays) is via a feature we are currently working on and call array views. These will be similar to non-materialized table views, quite common in Databases. In addition to facilitating the use cases described in this thread, it will be quite useful for us for implementing fine-grained access policies on arrays (data owners will be able to share any number of different array views, instead of the original array).
Just to further help describe the challenge, here is a visual summary of a real-world use case where subsetting is done for both QC at the initial stages but also at later stages to re-analyze subsets of cells. Note that separate cell annotations (i.e. cluster labels) were produced for each of the later subsets and therefore each subset needed its own obs
data frame. Although per @stavrospapadopoulos, it sounds like it could be handled on the backend with non-materialized table views (assuming that there is a way to store annotation matrices like the obs
without having a corresponding materialized data matrix), it is not entirely clear to me how the current API description handles this more complex use case.
CC @adamgayoso, who has been looking into approaches for this around scvi-tools
As of today this has not been implemented yet and there are no plans to do so in the near term. However it is worth maintaining it open in case we would want to revisit it later
From @nlhepler