adata slots for table and point spec in ngff

giovp commented 2 years ago

related to #64 and https://github.com/kevinyamauchi/ome-ngff-tables-prototype , as discussed this morning with @kevinyamauchi this is a description of adata slots and how they are used in https://github.com/theislab/squidpy and other spatial analysis tools of the https://github.com/theislab/scanpy ecosystem.

adata.X and adata.layers["layer"] store molecular info (gene/protein expression etc.).
adata.obsm stores various "latent" representations of obs (e.g. PCA/UMAP coordinates) but also:
- adata.obsm["spatial"] stores obs coordinates in space, with shape (N,2) or (N,3).
- adata.obsm["molecule_spatial"] will store molecule location in FISH-based data (with awkward arrays).
adata.varm no real use in spatial data afaik
adata.obsp stores adjacency matrices of e.g. graphs in spatial coordinates, knn graphs in latent spaces etc.
adata.varp no real use in spatial data afaik
adata.uns stores a bunch of image-related data. It is structured as follow:
- adata.uns["spatial"] contains library_id keys that correspond to unique identifiers of images (e.g. tissue slides). These values are also stored in adata.obs["library_ids"] which can be used to subset anndata based on the tissue slide of interest. Furthermore, inside adata.uns["spatial"][<library_id>] there are 2 more dictionaries:
- images for small-size tissue images (order of Mbs)
- scalefactors metadata related to scaling original coordinates in adata.obs["spatial"] as well as other infos

adata.uns also stores intermediate analysis results by several analysis tools in the ecosystem. e.g. trajectory analysis, velocity, various plotting params etc. I would therefore consider to support it for a better integration in the ecosystem. It would also be ok to store the same type of info in metadata in ngff, and then handle this on the API side (it'd be fine for us at Squidpy, not so sure for others).

@kevinyamauchi next week I'll try out https://github.com/kevinyamauchi/ome-ngff-tables-prototype and report back, thanks again for sharing.

Just want to mention one more time that this is super exciting and am really looking forward to see how it develops!

pinging various people @ivirshup @michalk8 @hspitzer @AnnaChristina @LucaMarconato

kevinyamauchi commented 2 years ago

Thanks, @giovp ! This is very helpful. In principle, it seems like all of these attributes could be included in the table spec. I don't think they add any additional data types (potentially with the exception of awkward arrays), so I don't think it adds much extra work for the non-python implementations.

A couple of follow-up questions:

are there any constraints to what is allowed to be stored in adata.uns? Is it generally dictionaries, numbers, strings, arrays of numbers, and arrays of strings? There wouldn't be some scanpy or squidpy object stored in there, right?
Do you think the images in adata.uns could be stored as an image in the OME-NGFF file? If so, maybe there could be a reference in uns that says where the image is stored. I suppose the additional bonus is this could potentially open the door to linking to images in different files. From our conversation, it seemed like this would be okay, but I just wanted to double check.
Are you already using the awkward arrays in anndata.obsm? If so, do you already have a way to write them to zarr? I was looking around and only found this issue. I think we briefly chatted about this in our call, but I can't remember what the conclusion was. Maybe @joshmoore remembers?
I was playing around with scanpy today and learned there is also AnnData.raw. Does that also need to be saved to disk?

joshmoore commented 2 years ago

Maybe @joshmoore remembers?

There hasn't been a zarr proposal yet. Interestingly, @eriknw joined the zarr call last night and I mentioned to him that @ivirshup might be getting in touch. I know @martindurant is interested as well. @msankeys963 and I can spend some time getting the existing issues cleaned up.

giovp commented 2 years ago

thanks @kevinyamauchi for prompt reply!

are there any constraints to what is allowed to be stored in adata.uns? Is it generally dictionaries, numbers, strings, arrays of numbers, and arrays of strings? There wouldn't be some scanpy or squidpy object stored in there, right?

exactly, I think there could pandas dataframes and there is interest for tuples and named tuples afaik but I think what you listed should be enough, maybe @ivirshup can comment more on this?

Do you think the images in adata.uns could be stored as an image in the OME-NGFF file? If so, maybe there could be a reference in uns that says where the image is stored. I suppose the additional bonus is this could potentially open the door to linking to images in different files. From our conversation, it seemed like this would be okay, but I just wanted to double check.

yes exactly, this is something that would be very useful and we'd be happy to change the current API in squidpy to accomodate that eventually. Current solution is not sustainable and doesn't really scale.

Are you already using the awkward arrays in anndata.obsm? If so, do you already have a way to write them to zarr? I was looking around and only found https://github.com/zarr-developers/zarr-specs/issues/62. I think we briefly chatted about this in our call, but I can't remember what the conclusion was. Maybe @joshmoore remembers?

I'm working on adding awkward array support in both varm and obsm here https://github.com/theislab/anndata/pull/647/ . IO is what it is currently missing, I am not sure how easy it'd be to write them to zarr, maybe @ivirshup can chip in here?

I was playing around with scanpy today and learned there is also AnnData.raw. Does that also need to be saved to disk?

I completely forgot about raw, sorry for that. Yes I guess that would also need to be saved to disk. Afaik it's not used much anymore, it was more useful for when anndata didn't have layer, again I'd like @ivirshup to comment on that.

kevinyamauchi commented 2 years ago

Thank you for all of the feedback, @giovp ! I will have a look at your awkward array PR.

Would the best way for me to play with an AnnData object with typical spatial data to make one using the instructions from one of your nice tutorials? Perhaps this one?

ivirshup commented 2 years ago

Values in `.uns`

are there any constraints to what is allowed to be stored in adata.uns? Is it generally dictionaries, numbers, strings, arrays of numbers, and arrays of strings? There wouldn't be some scanpy or squidpy object stored in there, right?

In the new release candidate you can put an AnnData in .uns. Basically anything that we're able to write can be put in uns.

ragged/ awkward array storage

I'm working on adding awkward array support in both varm and obsm here https://github.com/theislab/anndata/pull/647. IO is what it is currently missing, I am not sure how easy it'd be to write them to zarr

My hope is this should be fairly straightforward with ak.to_buffers, but maybe @joshmoore or other zarr developers would know more here. Would also be happy to take a different approach, like directly using the zarr ragged array encoding.

raw

So, I would like to deprecate .raw. Still need to figure out just how feasible that is, whether it needs to be directly replaced.

One option here is using mudata for shared observations with non-shared variables. I've written a bit more on this here: https://github.com/scverse/mudata/issues/13. Ambrose has also proposed this kind of approach attached around the singlecelldata/matrix-api, but I'm not sure there's much of that conversation on github.

joshmoore commented 2 years ago

My hope is this should be fairly straightforward with ak.to_buffers, but maybe @joshmoore or other zarr developers would know more here. Would also be happy to take a different approach, like directly using the zarr ragged array encoding.

:+1: Work here will likely start to ramp up soon (perhaps along with the soon to be listed https://github.com/zarr-developers/gsoc/tree/main/2022). Seems like it's a good time to get all of us working in the same direction.

ivirshup commented 2 years ago

Is there a good place to discuss the awkward array proposal? I'm wondering whether the goal here is more like ak.to_buffers or zarr ragged array. In particular, where do current needs sit on interoperability. Many languages do ragged arrays, but I think awkward adds some features on top of that.

ivirshup commented 2 years ago

@kevinyamauchi, one more point I forgot to add about potential future changes in AnnData: X may be just another layer. It's also optional at the moment.

One place this might play into the design on OME is for point data, if X is still being used to store the coordinates. Have you considered putting the coordinates in obsm instead, and having X be a sparse matrix? The points table could then have shape n_points x n_var_types. I think there could be a couple advantages here:

There's probably more annotation per kind of point than per coordinate dimension (and the dimension's metadata is captured elsewhere)
Less repetition of metadata per kind of point in the .obs table.
Numeric values per point, e.g. intensities, probabilistic assignments

giovp commented 2 years ago

@joshmoore an update regarding reading/writing awkward arrays in zarr, we ended up doing it with ak.to_buffers

This is the relevant code from https://github.com/theislab/anndata/pull/647

@_REGISTRY.register_write(H5Group, AwkArray, IOSpec("awkward-array", "0.1.0"))
@_REGISTRY.register_write(ZarrGroup, AwkArray, IOSpec("awkward-array", "0.1.0"))
def write_awkward(f, k, v, dataset_kwargs=MappingProxyType({})):
    import awkward as ak

    group = f.create_group(k)
    form, length, container = ak.to_buffers(v)
    group.attrs["length"] = length
    group.attrs["form"] = form.tojson()
    write_elem(group, "container", container, dataset_kwargs=dataset_kwargs)

@_REGISTRY.register_read(H5Group, IOSpec("awkward-array", "0.1.0"))
@_REGISTRY.register_read(ZarrGroup, IOSpec("awkward-array", "0.1.0"))
def read_awkward(elem):
    import awkward as ak

    form = _read_attr(elem.attrs, "form")
    length = _read_attr(elem.attrs, "length")
    container = read_elem(elem["container"])

    return ak.from_buffers(form, length, container)

where:

form is a json file string formatted
length is an int with array length
container is a dict with the actual data

as per API and tutorial

ome / ngff