theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

Semantic conversions? #62

Open ivirshup opened 2 years ago

ivirshup commented 2 years ago

Picking up a point from a recent FOM meeting, is there any interest in performing more advanced IO based on semantic labelling?

For a concrete example, I'm thinking of reading in a PCA as a LinearEmbeddingMatrix (as discussed waaay back (https://github.com/ivirshup/sc-interchange/issues/2#issuecomment-509084997).

Is round trip support for a LinearEmbeddingMatrix (and the ability to read one from a scanpy computed object) in-scope for this library? Is it worth it?

cc @joshua-d-campbell

lazappi commented 2 years ago

It might be useful to know more about what you mean by "semantic labelling" (particularly beyond this example).

For the PCA example I think what happens now is in AnnData2SCE() the coordinates are placed in reducedDims(sce) and the loadings are placed in a special varm column of rowData(sce), for SCE2AnnData() I'm guessing if there is a LinearEmbeddingMatrix in reducedDims(sce) the coordinates will be stored in obsm but the loadings would be lost (I haven't tested this though).

Making sure the loadings get converted in SCE2AnnData() should be relatively straightforward but I think the other way would be trickier (potential issues around matching the names of things).

I'm also not sure how much LinearEmbeddingMatrix is actually used. I can't remember hearing about it before now but maybe I have been using it without noticing.

ivirshup commented 2 years ago

It might be useful to know more about what you mean by "semantic labelling" (particularly beyond this example).

Basically that we have a way of saying "this array (or set of arrays) has this semantic meaning, so read it in as this type". E.g. "this is a PCA observation loading array, so zellkonverter knows to load it in as a LinearEmbeddingMatrix". This could also apply to a dataframe you may want to read in as GenomicRanges.

@joshua-d-campbell could probably expand here.

I'm guessing if there is a LinearEmbeddingMatrix in reducedDims(sce) the coordinates will be stored in obsm but the loadings would be lost (I haven't tested this though).

The variable loadings and other values are lost I believe.

think the other way would be trickier (potential issues around matching the names of things).

Yeah. I guess the question here is "how important is this for bioconductor" and then "how could scverse make this unambiguous" (probably deciding on some metadata standard). So:

I'm also not sure how much LinearEmbeddingMatrix is actually used.

This would be important. It was one of the first things @LTLA brought up previously, but if you haven't run into issues maybe it's not a high priority.

lazappi commented 2 years ago

I think we might only want to support this kind of labelling if it was an official part of anndata with a standard way of storing it in the file. If it's just a convention that scanpy uses I can see a lot of potential matching issues, particularly with conflicts in objects processed by other packages.

ivirshup commented 2 years ago

I think we might only want to support this kind of labelling if it was an official part of anndata with a standard way of storing it in the file.

I think thats fair. However I think storing specifications for AnnData's is out of scope for the anndata package. Instead I think could live in a specification project like single-cell-data/mams, which scanpy would commit to follow.