Questions about SingleCellExperiment layout

ivirshup / sc-interchange

Better interchange for single cell tools

8 stars 0 forks source link

Questions about SingleCellExperiment layout #2

Open ivirshup opened 5 years ago

ivirshup commented 5 years ago

Hey @LTLA, thanks for posting that description! I've got a couple questions about the SCE format:

Where do feature loadings for the dimensionality reductions get stored?
For colData and rowData, in practice, what gets included as nested values? Will an external tool store output in these?
Is there any canonical location or representation for graph data?
Is subtyping or modifying the structure of the object common? It seems like it could map well to AnnData as is, but I'm not sure how much the structure of the object can change.

LTLA commented 5 years ago

Where do feature loadings for the dimensionality reductions get stored?

We have a LinearEmbeddingMatrix class in SingleCellExperiment that can store these explicitly as metadata on the reduced dimension coordinates; the instance of the class itself is still a matrix where the number of rows is the number of samples, so it still fits as (an element of) the reducedDims slot.

TBH, this class should probably get moved to BiocSingular where it would be more appropriate.

For colData and rowData, in practice, what gets included as nested values? Will an external tool store output in these?

scater's the only current function example. calculateQCMetrics() will vomit out >50 QC metric columns that will pollute the colData(). If a user doesn't like that, they can set compact=TRUE and the function will return all the QC columns in a nested DataFrame scater_qc. This can be accessed with stuff like sce$scater_qc$total_counts. Outside of functions, users may want to store results of different analyses on the same data without resorting to prefixing all the columns names, e.g., sce$monocle$trajectory, sce$TSCAN$trajectory, sce$destiny$trajectory.

Is there any canonical location or representation for graph data?

No. We had discussed this, but we didn't think it made much sense to have a formal representation for a graph structure. In particular, we didn't know how the graph was meant to react upon subsetting or other manipulations of the SCE object. For example, if it's a SNN graph, and I subset the object, I can't just remove the nodes that I subsetted out; I would have to recompute the NNs and reconstruct the graph to get the "right" result. At that point, it becomes cheaper and easier just to ask any function that needs the graph to recompute the graph itself.

Of course, if no manipulations of the SCE are to be performed, then the graph is always valid and can be re-used in downstream steps. But in such cases, there's not a lot that can be done with the graph structure in the context of the SCE - we can just treat it as a black box and stuff it in the experiment-wide metadata (e.g., int_metadata) for methods to retrieve as desired.

The explicit use of a graph is also fairly limited to a specific subset of algorithms. If this is to be done, I would parametrize it as storing cell-cell matrices, which are more generally useful. Your typical NN graphs can be easily converted to/from sparse matrices, so there shouldn't be a problem there.

Is subtyping or modifying the structure of the object common? It seems like it could map well to AnnData as is, but I'm not sure how much the structure of the object can change.

Yes for SummarizedExperiment. Less so for SingleCellExperiment, though there's at least one subclass that I know of (TreeSummarizedExperiment). Generally, slots get added rather than modified, so there shouldn't be compatibility problems with derived classes unless the author intends it.

LTLA commented 5 years ago

After some rumination, I've realized that the objection to storing the graph would also be applicable to the storage of the reduced dimensions; certainly the subset of the t-SNE coordinates would not be correct compared to a recomputation of t-SNE on the subset. But we're definitely going to store the reduced dimensions, so I guess the same reasoning would justify storing the graph. (Even though the latter is a bit more egregious with its failure upon subsetting, e.g., if I built a 10-NN graph and subsetted to a random 10% of cells, the resulting subgraph would basically have no edges.)

Even so, I'd consider not storing the graph itself, but the matrix of edge weights. At least this can be generalized to any cell-cell metric (e.g., distances, correlations) and it won't be so embarrassing down the line when a non-graph-based clustering method shows up.

ivirshup commented 5 years ago

Thanks for the answers!

LinearEmbeddingMatrix class

Are the feature loadings in LinearEmbeddingMatrix tied to the rows of the parent SingleCellExperiment?

nested dataframes

Good to know. I know we don't have an equivalent for nested dataframes with our current libraries, and I'm not quite sure how to handle this yet. These could become dataframes in our obsm mapping?

we didn't know how the graph was meant to react upon subsetting

Yeah, I had this issue with subsetting graphs as well. But the argument about reduced dimensions also changed my mind. My POV at the moment is there are times when subsetting a graph by the nodes is the operation you want, and as long it's inexpensive for other cases it's fine.

Next questions:

You had mentioned creating summarized experiment subclasses to deal with interchange, how different would the objects have to be for you to want to take this approach?

Within an AnnData, we typically store fairly standard datatypes. Numbers, strings, categoricals, and various containers of these. Do objects like the genomic ranges in SingeCellExperiments have a standardized representation in an hdf5 file? Are there other objects in the SummarizedExperiment which might not have an obvious encoding?

LTLA commented 5 years ago

Are the feature loadings in LinearEmbeddingMatrix tied to the rows of the parent SingleCellExperiment?

No, the feature loadings don't respond to any subsetting of the containing object. This would be difficult from an implementation perspective - the LEM itself doesn't even have an option to subset by feature, only by sample (rows) or PC/factors (columns). IIRC, this was partially motivated by the fact that fixed feature loadings would guarantee recovery of the same PC coordinates in the LEM, regardless of how much subsetting was performed to the LEM. Moreover, the reducedDims slot also does not respond to feature subsetting, so subsetting the SCE by feature would never trigger any hypothetical feature-based subsetting of the LEM even if it were available.

Of course, one could just store the loadings in the (int_)rowData with an appropriate label, if one wanted it to respond to SCE subsetting. This is actually a good use case for the nesting; you could do something like the below for any number of different reduced dimension types:

library(S4Vectors)
df <- DataFrame(row.names=LETTERS) # genes
loadings <- matrix(rnorm(260), nrow=26)
df$reducedDimLoadings <- DataFrame(PCA=I(loadings))
df$reducedDimLoadings$PCA

If this was important, we could add augmentations to the reducedDim() method to keep the feature loadings synchronized with the reduced dimensions. But I haven't gotten any requests for this.

You had mentioned creating summarized experiment subclasses to deal with interchange, how different would the objects have to be for you to want to take this approach?

Hmm. I would say that, if I wasn't convinced that a particular slot/data member was sufficiently general, I would make a subclass rather than modifying SCE directly. For example, the SingleCellLoomExperiment is derived from the SCE with additional row/column graph constructs. This is fair enough, as it's explicitly designed for loom, which is pretty clear on the use of graphs; but I wasn't willing to lock myself in to supporting graphs formally in SCEs.

I actually suspect we could get away with not doing any subclassing if we're smart with our data members. If something is to be subsetted by sample, it goes into colData as a vector-like object; if anything is to be subsetted by feature, it goes into rowData; and if anything is to be subsetted by both, it goes into the assays. The slots are pretty generous in terms of stuff you can put into them.

Do objects like the genomic ranges in SingeCellExperiments have a standardized representation in an hdf5 file?

Not sure. Some of the methylation folks might know (@petehaitch), as they do more big-data-with-genomic-intervals than I do. In the simplest case, the GRanges are just collections of vectors of integers and strings anyway (parallel to the SCE rows); I'm sure we could serialize it into a reasonably palatable format from ingestion elsewhere. However, this really is the simplest case...

Are there other objects in the SummarizedExperiment which might not have an obvious encoding?

Oh boy. SEs are really generous with what you can put into them. Literally, as long as the object has methods for [ and have length() (and for assays, dim()) you can put it anything you want. Most objects will be based on the usual data types(*), so storing them in HDF5 is not the issue - it's trying to interpret them that will be a real pain. To make sense of them, you'd have to reimplement a lot of Bioconductor's base infrastructure in Python, and I think it's fair to say that's out of scope.

Personally, I wouldn't worry about the weird stuff. Just say that everything has to be atomic in this format, possibly with a few exceptions (e.g., some special handling of sparse matrices).

(*): some classes might hold C-style pointers to memory via file connection objects or R environments. This is pretty esoteric, though.

PeteHaitch commented 5 years ago

Do objects like the genomic ranges in SingeCellExperiments have a standardized representation in an hdf5 file?

The focus with the HDF5 work over the last 2-3 years has been on storing the assay data (e.g., counts, normalized log-expression) and not on any of the other data (colData, rowData, etc.). So there's no standardised representation in an HDF5 file.

ivirshup commented 5 years ago

Thanks @PeteHaitch! I think Aarons suggestion sounds reasonable. Possibly even strings of GFF records could be used here.

@LTLA I'm trying to figure out how we can represent your version of the PCA object in the anndata structure (as we tie most dimensions to the parent object). Do you have thoughts on this? I'm thinking we could have pointers in some per object metadata (to keep to keep arrays associated). How would you try to represent a PCA which possibly had it's feature loadings subset?

I'm also wondering how we deal with objects that can't be read by one library or another. I'm thinking that each element in the on disk representation could be tagged some encoding descriptor/ type information. That way a reader can tell if it has a method to read element (I think this would work well for arbitrary serialized objects). This could allow any AnnData or SingleCellExperiment object to be round-tripped within it's ecosystem. To me, this would also imply the more general feature of partial reading, since a reader should be able to skip data it can't do anything with.

LTLA commented 5 years ago

How would you try to represent a PCA which possibly had it's feature loadings subset?

IMO, there's two ways to play this with the current design.

Stuff the loadings in our rowData() (i.e., metadata parallel to the rows) while putting the actual PC scores in the reducedDims() (parallel to the samples). This will ensure synchronized subsetting but does have the side-effect of separating the loadings from the scores.
Create a LinearEmbeddingMatrix containing the subset loadings and store it in the reducedDims(). This keeps everything together but will not respond the subsetting of the rows.

These two formulations represent different philosophies to the treatment of the loadings. In 1, the loadings are treated directly as statistics of interest for downstream interpretation on a gene-by-gene basis; hence they respond to subsetting, because each row is handled independently. In 2, the loadings are not of interest in themselves, but are only kept around to project new cells into the same embedding as the PC scores. In this case, it is correct that these loadings do not respond to subsetting of the SCE, because any modification of the number of genes in the loading matrix would preclude proper projection.

So in conclusion, we can use either of these approaches, depending on how we expect the loadings to be used. Personally, I use 2 a lot more than 1, I don't personally use the loadings as statistics.

I'm also wondering how we deal with objects that can't be read by one library or another.

Hm. Perhaps the safest approach would be for the layout-writing functions to refuse to write something that isn't acceptable in the format. Otherwise you'll end up with situations like:

I do an analysis with SCEs and dump out something in our format. Part of it can't get saved into the standard format so instead gets quietly serialized using R's saveRDS() mechanism.
I pass this file to you, and you try to open in in Python with scanpy. And you can, but you can't access the part that didn't get saved portably, and thus cannot do a particular downstream analysis.
Much wailing and gnashing of teeth ensues. Vice versa for Python to R.

If the writers are not permissive, then I can't do 1 and I have to fix it before 2. I would say that this is the best approach with the shortest feedback cycle, especially when combined with good error messages on what is not permissible. Emitting warning messages is not enough, people just ignore them.

I'm not particularly concerned about round-trips. It would be nice, of course, but intra-language serialization is straightforward. It's the inter-language stuff that's the main game.

ivirshup commented 5 years ago

These two formulations represent different philosophies to the treatment of the loadings

This makes sense to me. But I think I view this as a different split. To me, it's what the purpose and meaning of subsetting is. To me, I think you can't promise that subsetting preserves meaning. It's impossible to know, for example, what it means if a some gene isn't present in your object. Was it filtered, was it not annotated, or was it checked for and never observed?

Separately, is it common to subset an dataset by variables, then project another dataset into the subset objects space? I'm not sure what's reasonable to expect to happen in this circumstance. I could imagine this coming up if you were subsetting to a shared gene space and then projecting – though there are of course issues with "correctness" here.

Perhaps the safest approach would be for the layout-writing functions to refuse to write something that isn't acceptable in the format.... If the writers are not permissive, then I can't do 1 and I have to fix it before 2.

I think this would be the right approach for the test suite of each languages main library. These libraries should ensure that most common types can be written (arrays, strings, numbers, mappings, etc.). That said, if some separate analysis package needs to store some custom object, I think that should be allowed. If they want, they should also be able to register readers and writers for their type so it can be stored in a cross compatible way. But I think they should also be free not to. The cost here is that their tool is less usable for other people, which has the typical cost of being used less.

I'm not particularly concerned about round-trips. It would be nice, of course, but intra-language serialization is straightforward. It's the inter-language stuff that's the main game.

I think round trip (at least to and from disk, maybe to and from ecosystems) would be a useful goal here. I think one of the best ways to make sure the interchange format is stable and useful is to make sure it's used. This could happen if it makes a perfectly fine analysis format on its own, if it's the default format you'd choose to write out.

This actually reminds me of a couple questions I've had about the usage of serialized R objects, which are kind of connected here:

How does backwards compatibility work? If LinearEmbeddingMatrix moves to BiocSingular, would you be able to read objects which were written when that class was defined in SingleCellExperiment? What if its attributes change?
Can you partially load or inspect a serialized object? Could you just load the expression matrix, or just get the result of str without loading the object into memory?

LTLA commented 5 years ago

To me, I think you can't promise that subsetting preserves meaning.

Then it sounds like you'd want to represent it as approach 1.

Separately, is it common to subset an dataset by variables, then project another dataset into the subset objects space?

A common practice is to do the PCA on a subset of genes (e.g., HVGs), but store the PCA results in the full object. If one wanted to project another data set into that PC space - which seems like a fairly reasonable step - one would need to use the same HVGs. For this, it would be natural to store the loadings for the HVGs somewhere, which would not be parallel to the full set of rows.

I guess you could expand the subsetted loading matrix to the full row dimensions by filling in the values for the unused genes with NAs or something. But that seems pretty awkward. More concerning is the possibility that someone re-subsets by row - possibly dropping a few of the HVGs - sees the subsetted loading matrix and tries (incorrectly) to project using those subsetted loadings. This would go through without errors but would give wrong results, while any attempt to project on the unmodified HVG loadings would correctly fail if the matrix dimensions are not compatible.

How does backwards compatibility work?

Not entirely sure in the case where class definitions move across packages. The failsafe is to manually run an updateObject method that updates the class attributes, but that may not be necessary.

Can you partially load or inspect a serialized object?

Nope. The closest would be a lazy-loaded database of R objects, but that requires each object to have been saved as its own variable; you can't pick out parts of a single object, AFAIK. The latter approach would not be good practice anyway, as the internal slots of an object are none of our business.

ivirshup commented 5 years ago

@flying-sheep, responding to you about https://github.com/ivirshup/sc-interchange/issues/3#issuecomment-509552625 here as I'm realizing it fits this thread better.

What do you mean with flattening? Are there nested data.frames in SCE?

I believe it's so you don't have too many keys in colData or rowData. pyarrows Tables allow this as well. Aaron talked about it here.

Separately, relating to the SingleCellExperiment's LinearEmbeddingMatrix: I'm wondering how we could maintain the connection of sample and feature loadings of a PCA. In particular, I'm wondering about generalizing references like our ViewArgs. This could allow our PCA metadata to refer to which matrix the PCA was calculated on, and which loadings it's connected to without relying on typical key strings. This could of course extend to other analysis objects as well. Would this be useful/ worth the added complexity?

ivirshup commented 5 years ago

@LTLA

Then it sounds like you'd want to represent it as approach 1.

I would like knowing a projection was valid, I'm just trying to figure out a simple common model for the data. I like the ability to keep operations on the dimensions consistent. For example, if I calculate a PCA then map genes to another identifier, how do I keep the names consistent? Does a callback have to be registered so the loading axis gets updated too? Do we need to have a concept of sub-dimensions (i.e. loading features ⊂ object features ) for this?

I think there's some middle ground here, if we recorded the feature set used for a calculation, you'd be able to have subsetting and tell if the full meaning had been retained. It would leave open being able to project after subsetting, which could be meaningful for other embeddings. I suppose this is like having sub-dimensions.

I guess you could expand the subsetted loading matrix to the full row dimensions by filling in the values for the unused genes with NAs or something.

We use zeroes, I think based on a sklearn convention

Perhaps the safest approach would be for the layout-writing functions to refuse to write something that isn't acceptable in the format.

A few more thoughts on this:

For some of the more controlled "groups" (like assays/layers or obs/ colData) I think this broadly makes sense. I think it'd be reasonable to be more permissive about unstructured annotation.
Permissive readers and writers could be useful for iterative development. If I'd like to just read and write parts of the file using another language, it'd be nice if I didn't have to implement the full spec.
Allowing arbitrary serialized objects could be useful for shipping something like a trained classifier with an analysed object. This does depend on what the scope of this object should include the whole analysis.

flying-sheep commented 5 years ago

re nested arrays: there’s multiple kinds of nested data frames, the usual R interpretation these days being basically a hierarchical row index:

  id_sensor             data    model       pred
1         a <tbl_df [327,2]> <S3:nls> <dbl[327]>
2         b <tbl_df [327,2]> <S3:nls> <dbl[327]>
3         c <tbl_df [327,2]> <S3:nls> <dbl[327]>

But what Aaron is talking about looks more like a hierarchical column index, right? How does that structure look in R?

We use zeroes, I think based on a sklearn convention

That was actually a happy accident/convergent evolution: I independently came up with it when helping Laleh Haghverdi with making her original diffusion map implementation (MATLAB!!) faster and more memory efficient. Both destiny and scanpy are based on that MATLAB implementation.

LTLA commented 5 years ago

I like the ability to keep operations on the dimensions consistent.

It sounds like we're... agreeing? If we put the loadings in the rowData, then it's sync'd up to the rows. If the PCA was done on a subset of features, we pad out of the unused features with zeroes. If the same feature was duplicated in the subset, we aggregate the loadings for all duplicate rows.

As an aside, I'll note that the padding out with zeroes is slightly misleading, as it suggests the unused features have loadings of zero in the PCA, rather than not being used at all. But the practical benefits seem to outweigh this, it's more convenient as you can just matrix-multiply with the whole thing.

Do we need to have a concept of sub-dimensions (i.e. loading features ⊂ object features ) for this?

This sounds hairy. Perhaps a simpler approach would be to simply prevent the user from trying to project with a loading matrix that has been irrevocably modified by subsetting. Say we start from a valid SCE (or AnnData, or whatever) with feature loadings in the rowData. We use a special matrix class for the feature loadings that checks, upon subsetting (or related operations, e.g., combining), whether or not the operation discards/duplicates non-all-zero rows. If it does, it trips a flag that marks the loadings as "not to be used for projection". Projection methods can pick up on this and throw an error. We do not attempt to recover the original loadings, but we do at least protect the user from themselves.

Of course, it's debatable how much of this is the responsibility of the file format vs the in-memory representation vs the user. Subsetting by row is a common operation, but I'm not sure it's a frequent one, i.e., it's done in all my analysis pipelines but only once at the start. All other subsetting is done inside the relevant functions so that the user only interacts with the full object. So for me, getting to the point of performing an incorrect projection with a subsetted loading matrix would be rather unusual. Not impossible, but odd enough for me to not spend the engineering effort protecting against it.

ivirshup commented 5 years ago

It sounds like we're... agreeing?

I think so. Just to be clear, what are you proposing would happen within SCE? If a user calculates a PCA, writes it to disk in this format, and reads it back, will they have a LinearEmbeddingMatrix?

So for me, getting to the point of performing an incorrect projection with a subsetted loading matrix would be rather unusual.

The use case that I've been thinking of for this is when you receive a processed object from someone else, maybe a datasource like your scRNAseq package. How do you tell whether you can use some object in the experiment to project new data? How do I know that the network wasn't calculated on a larger dataset, or some genes were filtered after a PCA was computed?

We use a special matrix class for the feature loadings that checks, upon subsetting (or related operations, e.g., combining), whether or not the operation discards/duplicates non-all-zero rows.

This would work, but relies on wherever the subsetting happened to know which constituent objects need to have a flag toggled. I think it would be easier to manage if we relied on the creation of the object to provide the necessary data (e.g. the PCA function records enough metadata for the user to tell if it's reasonable to use in projection). This could be some simple guess, like the shape of the input matrix, or potentially the set of features and samples present when it was calculated – probably ignoring issues of sorting or renaming. It could also be a hash of the input matrix.

Of course, it's debatable how much of this is the responsibility of the file format vs the in-memory representation vs the user.

Definitely agree with this. I don't like the idea of limiting the user too much, but it would be good for them to have enough information available to tell what operations are valid.

To me, this has brought up the issue of metadata. Do you have some standard for how metadata is recorded for applied functions? Something like a record of the parameter values?

LTLA commented 5 years ago

Oops, didn't realize you had a non-rhetorical question in there.

Just to be clear, what are you proposing would happen within SCE? If a user calculates a PCA, writes it to disk in this format, and reads it back, will they have a LinearEmbeddingMatrix?

Yes, that would be the plan.

Well, while I'm here, I might as well add my thoughts on your other points.

How do I know that the network wasn't calculated on a larger dataset, or some genes were filtered after a PCA was computed?

In the case of the PCA and a LinearEmbeddingMatrix, I would use the fact that the rotation matrix is not sync'd to the rows of the parent object, and thus doesn't get modified upon gene filtering. This enables a simple heuristic of just checking that any new cells to be projected have the same number and order of gene names as in the rotation matrix. Of course, this approach would complain unnecessarily if someone deliberately renamed the genes at some point, but I think that's tolerable.

It could also be a hash of the input matrix.

I've used hashes to track changes in other projects. It works but can be frustratingly fragile in interactive analyses. Perhaps it's okay in this limited application.

Do you have some standard for how metadata is recorded for applied functions? Something like a record of the parameter values?

R has this match.call() function that reports how the function was called. This is commonly seen throughout a lot of the base R functions, e.g., lm() will return an object that contains the call to itself. If you re-ran the call, you should get the same output (assuming nothing had changed in the meantime). You can also dissect the call to recover the argument values and put them somewhere for safe keeping. I don't use it much because I haven't needed to.