scverse / spatialdata

An open and interoperable data framework for spatial omics data
https://spatialdata.scverse.org/
BSD 3-Clause "New" or "Revised" License
212 stars 40 forks source link

New multiple table in-memory design #298

Closed LucaMarconato closed 5 months ago

LucaMarconato commented 1 year ago

Scope of the proposal

After receiving feedback from @EliHei2 @lopollar @melonora @LLehner @AdemSaglamRB, hearing users asking for multiple table support as an important requirement for the adoption of SpatialData in their tools (Maks, Louis) and having personally examined the development implications of the current table specification, I propose a new in-memory design for supporting multiple tables in SpatialData, that I have been maturing in the past months. Importantly, the on-disk table specification will not change.

Timeline and feedback

I propose to implement the new in-memory representation after the following work: aggregation and query refactoring (me), io refactoring (@giovp), new pipeline for daily tasting all the notebook (me), synthetic datasets for R/JS interoperability (me), napari bugfix (me and others). So ideally we would start in 1-1.5 month, enough time for receiving feedback and improving the design. In particular I kindly ask for feedback to @kevinyamauchi @ivirshup @gtca @ilia-kats. I put some usernames in raw not to spam them now, but I will tag everybody after the first draft is refined, and additionally @timtreis @berombau @joshmoore.

Disk specification

The disk storage will remain the one described in https://github.com/ome/ngff/pull/64. What I propose to change is the in-memory representation of the data: instead of directly representing in-memory the metadata that we have on-disk, I propose to put the data in a form that is more ergonomic and that follows the (geo)pandas, anndata and muon indexing common practices.

In-memory representation

I propose the following.

Code example 1: 3 visium slides, one common table.

Let's see how the current code and the new code would compare.

Current approach

from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData

# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)

# shared table
adata = AnnData(np.random.rand(30, 20000))
adata.obs['region'] = (['visium0'] * 10 + ['visium1'] * 10 + ['visium2'] * 10).copy()
adata.obs['region'] = adata.obs['region'].astype('category')
adata.obs['instance_id'] = np.array(list(range(10)) + list(range(10)) + list(range(10)))
adata = TableModel.parse(adata, region=['visium0', 'visium1', 'visium2'], region_key='region', instance_key='instance_id')

sdata = SpatialData(shapes={'visium0': visium_locations0, 'visium1': visium_locations1, 'visium2': visium_locations2}, table=adata)
sdata

which gives

SpatialData object with:
├── Shapes
│     ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
│     ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
│     └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 30 × 20000
    obs: 'region', 'instance_id'
    uns: 'spatialdata_attrs': AnnData (30, 20000)
with coordinate systems:
▸ 'global', with elements:
        visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)

Common operations on the table, such as finding all the rows that corresponds to an element (such as visium1), sorted in the same order, are available via helper functions, in this case the following:

from spatialdata import match_table_to_element

match_table_to_element(sdata=sdata, element_name='visium1')

which gives

AnnData object with n_obs × n_vars = 10 × 20000
    obs: 'region', 'instance_id'
    uns: 'spatialdata_attrs'

But even a simple operation as this one requires laborious code (see here and the function called internally here). This requires lot of work for us to implement and maintain the code, and hinders the ability to the user to easily extend the codebase, or code a custom implementation to cover specific use cases.

In addition, I got told by some users that they don't feel natural/get the region, region_key, instance_key approach, and they end up not using that metadata.

New approach (pseudocode)

from spatialdata.models import TableModel, ShapesModel
from anndata import AnnData
import numpy as np
from spatialdata import SpatialData

# shapes
visium_locations0 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations1 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)
visium_locations2 = ShapesModel.parse(np.random.rand(10, 2), geometry=0, radius=1)

# shared table
adata0 = AnnData(np.random.rand(10, 20000))
adata1 = AnnData(np.random.rand(10, 20000))
adata2 = AnnData(np.random.rand(10, 20000))
sdata = SpatialData(
    shapes={
        "visium0": visium_locations0, 
        "visium1": visium_locations1, 
        "visium2": visium_locations2
    }, 
    tables={
        "visium0": adata0, 
        "visium1": adata1, 
        "visium2": adata2
    },
)
sdata

which is much simpler than the previous approach. The code would give

SpatialData object with:
└── Shapes
      ├── 'visium0': GeoDataFrame shape: (10, 2) (2D shapes)
      │   └── AnnData object with n_obs × n_vars = 10 × 20000
      ├── 'visium1': GeoDataFrame shape: (10, 2) (2D shapes)
      │   └── AnnData object with n_obs × n_vars = 10 × 20000
      └── 'visium2': GeoDataFrame shape: (10, 2) (2D shapes)
          └── AnnData object with n_obs × n_vars = 10 × 20000
with coordinate systems:
▸ 'global', with elements:
        visium0 (Shapes), visium1 (Shapes), visium2 (Shapes)

Getting the table corresponding to visium1, making the order of the rows match, does not require helper functions and would be as simple as

indices = sdata.shapes['visium1'].index
sdata.tables['visium1'][indices, :]

Concatenating and splitting

If the user needs to concatenate the various subtable to a unique one, a way to do this would be something like this (still a bit laborious but way less than using region, region_key and instance_key and not introducing a knowledge entry barrier to the user). Also, we could bundle it in an helper function.

Merging the table into a global one:

from anndata import concat

adata_full = concat((adata0, adata1, adata2), keys=['visium0', 'visium1', 'visium2'], index_unique='_')

Splitting back the merged table

adata0 = adata_full[adata_full.obs.index.map(lambda x: x.endswith('_visium0'))].copy()
adata0.obs.index = adata0.obs.index.map(lambda x: x.replace('_visium0', ''))

Code example 2: Visum + Xenium

Currently for representing the Visium + Xenium data we need multiple SpatialData objects and also using AnnData layers and .obsm to store the result of aggregation. For instance this is the visium_roi_sdata object produced by the Visium + Xenium notebook "00":

SpatialData object with:
├── Shapes
│     └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 2826 × 307
    obs: 'in_tissue', 'array_row', 'array_col', 'spot_id', 'region', 'dataset', 'clone'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial', 'spatialdata_attrs'
    obsm: 'spatial', 'xe_rep1_celltype_major', 'xe_rep1_celltype_minor', 'xe_rep2_celltype_major', 'xe_rep2_celltype_minor'
    layers: 'xe_rep1_cells', 'xe_rep2_cells', 'xe_rep1_tx', 'xe_rep2_tx': AnnData (2826, 307)
with coordinate systems:
▸ 'aligned', with elements:
        CytAssist_FFPE_Human_Breast_Cancer (Shapes)

Note in particular that:

Using MuData tables would keep things more organized, something like this

SpatialData object with:
└── Shapes
    └── 'CytAssist_FFPE_Human_Breast_Cancer': GeoDataFrame shape: (2826, 2) (2D shapes)
        └── MuData object with n_obs × n_vars = 2826 × 1535.
            ├─── 'visum expression': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_cells': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep2_cells': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_tx': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep2_tx': AnnData object with n_obs × n_vars = 2826 × 307.
            ├─── 'xe_rep1_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
            ├─── 'xe_rep2_celltype_major': AnnData object with n_obs × n_vars = 2826 × 10.
            ├─── 'xe_rep1_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
            └─── 'xe_rep2_celltype_minor': AnnData object with n_obs × n_vars = 2826 × 10.
with coordinate systems:
▸ 'aligned', with elements:
        CytAssist_FFPE_Human_Breast_Cancer (Shapes)

Notes:

Further considerations

Rows annotating multiple regions

Sometimes we would like to have rows of a table annotating multiple instances in different regions (e.g. the same cell represented as a circle, as a polygon or as a label). We had some discussions in https://github.com/scverse/spatialdata/issues/34 and https://github.com/scverse/spatialdata/issues/99 and on Zulip, but I came to the conclusion that the easiest solution is just to duplicate the table, and having the same table annotating the regions object. In particular this solves the problem of having different indices or having more or less instances in one regions objects. Matching indices can always be recomputed on the fly using geopandas.sjoin() or equivalent functions for raster data that we will need to write anyway to extend aggregate().

Subset by var/obs

One of the arguments to have one single table instead than multiple tables was to support syntaxes like this one (see more here https://github.com/scverse/spatialdata/issues/43)

idx = (sdata.table.obs.cell_type == "Neuron") & (sdata.obs.n_counts > 100)
sdata = sdata[idx, :]

But the reality is that we haven't implemented such function yet and that implementing these things can be really messy (as shown in the helper function discussed above to subset and reorder a table by a regions object). Furthermore the implementation would be made also complex by the fact that the cell_type information is not necessary found in the table, but could also be located in a column of the GeoDataFrame as discussed above.

The syntax above would instead simply become something like this, which also feels very natural

idx = (
    (sdata.tables['visium'].obs.cell_type == "Neuron") 
    & (sdata.tables['visium'].n_counts > 100)
)
sdata = SpatialData(
    shapes={"visium": sdata["visium"][idx]}, 
    tables={"visium": sdata.tables["visium"].loc[idx, :]},
)

Other linked issues

Linked discussion issues:

Issues that will be not relevant anymore with this new design:

giovp commented 1 year ago

thanks @LucaMarconato for this detailed post. I will try to answer point by point, while I agree that current implementation is not optimal, I am still not convinced by the multiple table idea. I agree nonetheless that is prob worth to explore alternatives.

Response to multiple tables arguments

Subset by var/obs

One of the arguments to have one single table instead than multiple tables was to support syntaxes like this one (see more here > https://github.com/scverse/spatialdata/issues/43)

idx = (sdata.table.obs.cell_type == "Neuron") & (sdata.obs.n_counts > 100)
sdata = sdata[idx, :]

this is partially true. While this is still very useful and indeed currently complicated to implement, another big motivation for single table was that in practical use cases the user will do most of the molecular analyses in the single table level. e.g.

sc.pp.normalize_total(sdata.table)

and not

for table in sdata.tables:
    sc.pp.normalize_total(table)

which in fact has a drastically different outcome.

Furthermore the implementation would be made also complex by the fact that the cell_type information is not necessary found in the table, but could also be located in a column of the GeoDataFrame as discussed above.

this is also not completely true, as we implicitly discourage to store additional annotations in the geodataframe. For example, no spatialdata-plot or napari-spatialdata function will be able to leverage that info for visualization, since both packages expect anything that can be used as annotation in the table. I have fairly strong priors on why I would prefer to keep it like this.

  • To not represent region, region_key, instance_key in-memory, and instead relying on the same shared indexing between the table and the (single) corresponding shapes/labels element. Code examples below. This may require allowing AnnData to use custom indices and not just string (or at least integers), or would force the (geo)dataframes to have string indices. We will find a way around this.

I don't think this is very trivial tbh and was in fact a big blocker of putting geopandas dataframes directly in anndata (but not the only one). However, it might be a useful solution, will describe below. we might be able to overcome this actually

  • For points, that are DaskDataFrame (and soon also DataFrame) and shapes, that are GeoDataFrame (and soon maybe also DaskGeoDataFrame), the user can choose to store metadata as additional columns of the dataframe, without having to add a new table. I wrote an helper function (get_values()), similar to scanpy.get.obs_df(), that allows to retrieve a column from the df, from obs or from var. So having information located either in the dataframe or in the table is not a problem. Further note, the implementation of get_values() would become much simpler with the new table representation.

Again I feel pretty strongly against this, and while it's true that for points we don;'t have an alternative, it's also true that for points the type of annotation is "different" than what we want for regions (shapes, labels).

Xenium + Visium example Notes:

  • now in the same object we could store also the Xenium data if we wanted. No need to use different SpatialData objects.
  • if we start with an implementation that doesn't use MuData but only AnnData, we can still obtain something similar to the above by duplicating the shapes layer. Suboptimal, but still the duplicated data has low memory impact compared to the iages, so it's a good temporary compromise.

I agree that in the above point having a single mudata object would be useful. The data duplication while maybe not impactful from memory perspective, is still quite confusing IMHO.

Response to spatialdata_attrs arguments

Concatenating and splitting

If the user needs to concatenate the various subtable to a unique one, a way to do this would be something like this (still a bit laborious but way less than using region, region_key and instance_key and not introducing a knowledge entry barrier to the user). Also, we could bundle it in an helper function.

While I understand the critique, I would be very interested to see a different proposal on this. In particular in the case where regions are labels . We thought quite hard and long about this while discussing the ngff proposal. It is true that it is suboptimal in the case where regions are shapes, but I really don't see an alternative in the labels case.

Separate proposal

I feel like a lot of the complexity that is raised in the above points, as well as in practical experience with spatialdata, is indeed the fact that we don't have a global index. If we were able to store the shapes region as geopandas dataframe directly into anndata, than a lot of the indexing confusion would be resolved, and also situations like

Labels would still be stored separately since they are raster types, but I feel there is less problem if the index mismatch happens with labels. I am not sure how coordinate systems and transformations could be handled in this case yet.

Nevertheless, my very short proposal is the following: store shapes information directly into anndata (potentially obsm)

LucaMarconato commented 1 year ago

Thanks for adding your comments to the proposal. I answer by point.

Response to multiple tables arguments

On concatenating tables

single table was that in practical use cases the user will do most of the molecular analyses in the single table level. e.g.

sc.pp.normalize_total(sdata.table)

and not

for table in sdata.tables:
  sc.pp.normalize_total(table)

I agree, the user would need to concatenate and deconcatenate the table, it would be one of the drawback of the proposal, a cost that I think it's still worth to pay.

But I also see as a solution to this: the users, instead of contatenating and deconcatenating the table, can instead concatenate both the shapes elements and the table. This is something that probably the user would like to do anyway when stitching together samples that have some overlap. in fact in that case the overlapping regions would be chosen only from one shapes element or the other, so operating on a new geodataframe and new unified table would be handier.

Extra columns in dataframes

this is also not completely true, as we implicitly discourage to store additional annotations in the geodataframe. For example, no spatialdata-plot or napari-spatialdata function will be able to leverage that info for visualization, since both packages expect anything that can be used as annotation in the table. I have fairly strong priors on why I would prefer to keep it like this.

While I was reluctant to have this, I actually realized when working on the data and talking to users (see for instance here https://github.com/scverse/spatialdata/issues/311) that it would be convenient to provide this flexibility. The get_values() functions is designed to help with this: I made it in the context of aggregate(), since I wanted to be able to aggregate the columns present in the points. Furthermore if we decide to store points as (dask)geodataframe it will be very natural to add columns to the geodataframe object used for circles, since then circles and points would be very similar.

GeoDataFrame inside AnnData

I don't think this is very trivial tbh and was in fact a big blocker of putting geopandas dataframes directly in anndata (but not the only one). However, it might be a useful solution, will describe below.

One clarification: the blocker of having int indices in AnnData or str indices in GeoDataFrame would remain, but I don't propose to have GeoDataFrame in AnnData, but to have a separate AnnData with common indices. The advantages of this are:

Response to spatialdata_attrs arguments

Annotations for labels can be done without region, region_key, instance_key.

The new proposal would also extend to labels. Say that you have a labels for which np.unique(labels) is [0, 1, 2, 3]. Then you can have the annotation in a table with indices 1, 2, 3. The mismatch due to the background is not a problem because we can allow, as said above, for mismatch between the indices, and there is no need for extra spatialdata_attrs metadata.

More precisely, there is no need for region_key because now we have one table for each labels (regions) element. So to get the table for the labels sdata['labels'], one simply does sdata.tables['labels']. So, also region is not needed. And finally instance_key is not needed because the table stores what was the instance_key column in the index.

I see that in the labels case, contrarily to the shapes case, the user needs to concatenate and deconcatenate the table (because stitching to a unique large labels may not be an option), but I see this as a price that it's good to pay. Also because the user can always approximate cells as polygons/centroids, and merge the shapes and tables together.

Other considerations

A free outcome out of this proposal is that one, if needed, could even add tables for points. This could be useful when only the centroids are needed, for instance when using a table to store a graph between the centroids of cells, or when the distinction between points and circles is arbitrary (like for the seqfish data, since the radius of the circles is not known and we add one just so that the visualization is sensible).

My take

Considering the above comments, here is my answer to this:

I feel like a lot of the complexity that is raised in the above points, as well as in practical experience with spatialdata, is indeed the fact that we don't have a global index. While the index conflict between geopandas and anndata is definitely a big challenge, it might be worth to look into solutions for this. If we were able to store the shapes region as geopandas dataframe directly into anndata, than a lot of the indexing confusion would be resolved, and also situations like

"same region with different features" "same feature with different region types (e.g. circles and polygons in xenium) could be directly handled by anndata or mudata. Labels would still be stored separately since they are raster types, but I feel there is less problem if the index mismatch happens with labels. I am not sure how coordinate systems and transformations could be handled in this case yet.

Nevertheless, my very short proposal is the following: store shapes information directly into anndata (potentially obsm)

  1. I think that we should not have a global indices and really focus on local tables with shared indices with the respective geometries.
  2. I think there will not be problems with labels.
  3. With the new coord systems refactoring there will be no problems with coordinate systems and transformations (since they will be stored in the parent spatialdata object and each element will only have the string name of the coordinate system in which the element lives in)
  4. As explained above I am more in favor of having tables and shapes separate, and linked only by having a common name.
aeisenbarth commented 1 year ago

Another aspect I didn't find when skimming through the proposal:

Parallel table updates

As far as I remember, one of the strong points for Zarr as storage backend was parallel write access.

Imagine a SpatialData object with n images. A user creates n parallel processes where each segments an image and stores a label (no write conflict, except maybe labels/.zattrs?). Then further n parallel processes where each performs region measurements and adds the resulting table to the same SpatialData, which has only a single (!) table.

Currently, when "appending" a table, I have to check whether the current table is None and create one or do anndata.concat/AnnData.concatenate with the existing and new table and assign it to SpatialData.table.

  1. I might be wrong, but for a backed AnnData, concatenation seems not to write changes back to the storage backend, but only returns the concatenated as a new in-memory object.
  2. When assigning to SpatialData.table, it then tells me I have to delete the existing table. I am sure deleting and writing the whole table again will cause conflicts between processes.

Whether we have a single table or multiple tables, there should be a thread-safe way to append a batch of new rows to it (and only write the new data).

ivirshup commented 1 year ago

I might be wrong, but for a backed AnnData, concatenation seems not to write changes back to the storage backend, but only returns the concatenated as a new in-memory object.

Next release of anndata will have support for concatenating stored anndata objects.

MobiTobi commented 12 months ago

User Interface

they don't feel natural/get the region, region_key, instance_key approach

Hard agree. For me it was impossible to add a table without referring to models.py. And even then I did not grok immediately why I need a REGION_KEY and a REGION_KEY_KEY

I really like the approach where each element can be equipped with a table. This organisation leads to a very obvious mental models of the contents of a spatialdata object. This is less relevant for homogenous sdata objects, so let me give you an example for a combined Visium + Xenium experiment:

In this multitable world the relationship between table entries and spatial elements is plain obvious. Everything is easily discoverable, even without documentation by the creators of a specific sdata object. A monolithic table would be much less transparent intermingling features of spots, cells and tissue domains.

"There should be one — and preferably only one — obvious way to do it."

we implicitly discourage to store additional annotations in the geodataframe

Users repeatedly storing additional annotation in the geodataframe points to unmet needs on the user side. By reusing established objects from geopandas/pandas/xarray/dask users of spatialdata inherit access to all their methods and attributes. This flexibility is not a bad thing per se, but invites users to develop idiosyncratic usage patterns. Idiosyncratic usage of spatialdata inhibits the smooth collaboration between users, leads to unexpected edge cases/bugs and prevents users from benefiting from the full range of spatialdatas features.

An obvious and flexible user interface is the best preventative measure against users experiencing the need to bend spatialdata to their will by falling back on the flexiblity of the basal objects. I think multible tables can provide both the flexiblity and the obviousness to guide users to the one way to do things.

lopollar commented 12 months ago

Whether I prefer one table /multiple tables depends on the datasets I am using.

When having multiple fields of view in xenium that match one visium experiment, in the ideal world you would be able to group the xenium experiments in one table and have the visium in another table. However, I think this will become hard to keep intuitive.

I would prefer to move everything to different tables. A way to overcome the first issue would be (as @berombau already suggested somewhere) to have the function sdata.table that concatenates all different sdata tables. Ideally, for me, you would be able to perform analysis steps on this concatenated table, and have them also updated on the subtables (but to me, this sounds like a lot of work to program). The indices, however, possibly become a mess, as cells with the same index can be in the table. You would need a key (how different to region_key?) or multi-index to keep them separated.

But I also see as a solution to this: the users, instead of concatenating and deconcatenating the table, can instead concatenate both the shapes elements and the table.

If the different tables and shapes refer to different images, with different labels layers and contain duplicated indices, it will be difficult without the key (that you would want to deprecate) to match the cells back to the correct image, which is necessary for all downstream spatial analysis (plotting, neighborhood analysis,...) .

Matching tables to images

In the whole discussion, you mention matching the shapes, labels and table layer by name. However, how would you save the link between images and table/shapes/labels when not using a region_key? This is handy when plotting, or taking spatial subsets, or linking tables based on the images. Maybe this isn't really necessary, and it is possible to refer to the image manually every time.

they don't feel natural/get the region, region_key, instance_key approach

I was one of them because when you start analysing data, you start with one field of view, and one analysis pipeline. In that case, you don't need region_key, because all your cells are belonging to the same image and coordinate system. I think they are useful for analysis that go beyond one image analysis. Maybe make it possible to not use them, so you only have to think about it when working with multiple images? Or explain clearly why they are there?

but I came to the conclusion that the easiest solution is just to duplicate the table, and have the same table annotating the regions object.

Make sure if you do this, that when you update one table, the duplicated table is updated as well.

shapes in anndata

Nevertheless, my very short proposal is the following: store shapes information directly into anndata (potentially obsm)

ALtohugh this is intuitive, this is not always practical, for sure not if you want to include shapes layers that aren't cells (veins, regions,...), these you don't want to save in your shapes layer. When taking subsets of the data, you might still want to plot all other shapes. (plot the shapes that are in the anndata in colour, all the other ones are just the outline for example).

Also, if you would do it, solve the issues with interoper

shapes annotation

I wouldn't necessarily use this to annotate the shapes of the cells, but to annotate shapes of all object that aren't cells, like the veins, or necrotic layers, or annotated layers. Now if you want to annotate youregions, they all need a separate layer. I think these things are still very interesting to save in the same sdata object, as you use them to calculate features of the sdata tables, and you visualize them on the same images

mudata

I am currently not yet working on this type of data, but this will be very handy for multiome data. However, not all cells always will match up.

The reason why I don't think this will solve the issue with transcriptomics is related to the fixed dimensions. It happens that a cell is picked up based on membrane staining but has no nucleus. In this case, the cell will only be present in the anndata of the whole cell and not in the one of the nucleus, and I don't know how good mudata an handle that (it happens in both directions btw, because of the membrane stainings not yet being perfect).

keller-mark commented 11 months ago

Changing the on-disk format seems out of scope for this in-memory discussion. Nevertheless, it might be useful to imagine the ideal on-disk format, and incrementally build towards it by first having an in-memory format that reflects it. (Also, as a user interested in accessing the data from JS for visualization, the on-disk representation is the primary way I am interacting with SpatialData).

Apologies if any of this is redundant with things already discussed above.

Idea

Generalization of MuData to allow for storing separate tables by element, region, and modality. Each of these tables could have different shapes/columns, like in MuData, but the semantics of "element", "region", and "modality" could help determine how concatenation should occur.

E.g., index values would only need to be unique per-region. If index values are shared across elements (or modalities) within a region, they would be assumed to refer to the same entity. SpatialData could provide helper functions for augmenting index values with region/element IDs to account for this (if using multi-index is not feasible).

APIs could look like:

sdata.tables.mod['rna'].region['tumor'].element['shapes']

or

sdata.tables.mod['rna'].element['shapes'].region['tumor']

Modality-agnostic:

sdata.tables.region['tumor'].element['shapes'] would return only the modality-agnostic table information. Perhaps this is the only option if we ignore the multimodal support.

If we omit region or element information, we can concatenate the tables by adding extra columns:

sdata.tables.region['tumor'] would have an appended element column.

sdata.tables.element['shapes'] would have an appended region column.

sdata.table (and sdata.tables.mod['rna']) would have both element and region columns appended.

A functional form could allow using custom names for these appended columns:

sdata.tables.get(element='shapes', region_key='custom_region_key')

or

sdata.tables.get(element_key='custom_element_colname', region_key='custom_region_colname').

Constructor syntax proposal:

sdata = SpatialData(
    shapes=[
        ShapesModel(region='visium0'), 
        ShapesModel(region='visium1'), 
        ShapesModel(region='visium2')
    }, 
    tables=[
        TableModel(mdata0, region='visium0', element='shape'), # If MuData or modality-agnostic AnnData
        TableModel(adata1, region='visium1', element='shape', mod='rna'), # If MuData or AnnData
        TableModel(adata2), # If the region/element columns already exist in the AnnData object
    ],
)

On-disk format:

The on-disk format could have a deterministic structure, hierarchical like table/mod/{mod}/region/{region}/shapes (with something like table/region/{region}/shapes for modality-agnostic info) or somehow flattened as mentioned above:

On disk, the MuData table would be saved as a list of tables, so that the current specs are still valid.

This splitting of the tables on-disk, while out of scope here, would be more favorable to the use case of loading subsets of data via HTTP requests.

keller-mark commented 11 months ago

Matching tables to images

Related, from the perspective of any one SpatialElement, is the user able to easily understand which other subset of data it is related to?

Currently, there is a way to "navigate" from a table to the other elements. I can check table/table/uns/spatialdata_attrs and then look in shapes/{region}

Once there are multiple tables, can I still "navigate" from shapes/{region} or labels/{region} to the table which annotates those shapes (without checking table/table/uns/spatialdata_attrs)?

Perhaps this question only becomes relevant with a different on-disk format though, since in-memory, it should be easy to iterate over multiple tables metadata.

A solution to using string keys to match region information could be to use the extent of the coordinates of the region. This could help allow matching data between regions that do not 100% overlap. For example, if the tables for each region are stored separately on-disk, then the extents of their coordinates can also be stored easily on-disk. Then users can use "collision detection" to understand which subsets of data overlap across elements.

E.g., if a user of our visualization tool specifies that they are interested in viewing the Visium spot shapes, then we look up the extent of the coordinates of the region containing those spots, and then check if there is any tabular data that both annotates shapes and overlaps with this region in space.

EliHei2 commented 11 months ago

Howdy @LucaMarconato @giovp @ivirshup @kevinyamauchi @others!

I'd like to share my thoughts on why using multiple tables could be a great idea for future:

Here’s just a few reasons I could name, but I’m sure there’s more. I'd be happy to hear why people are rooting for one table only, and discuss further!

LucaMarconato commented 11 months ago

Thank you everyone for the time to write your very precious feedback!

I have discussed the content of this issue and all the proposals with @berombau @SilverViking @lopollar @melonora @ivirshup @arnedefauw in-person and all the participants agreed that adding multiple tables support is a priority.

The proposal that I have added in the first post was well received, I here summarize it very briefly for who didn't read the full discussion:

The main technical challenge for implementing the proposal is the fact that AnnData does not support integer indices, but I will try to work a solution around this, at worst representing the same integer as strings in AnnData and as string/int elsewhere.

The three main design points that were highlighted as delicate were the following:

  1. in multimodal cases the new specification works, while for single modality multi-sample cases the specification would be less ergonomic because the user would have to manually merge and unmerge the tables, or loop around them. The user can work around this (while with the current specs there is not really a way to work around the multiple table limitation); so the consensus was to implement the multiple tables support first, and then reiterate after receiving user feedback to try to allow for a more ergonomic merged table experience in the multiple sample case.
  2. related to the previous one: if tables are being concatenated and deconcatenated, one would get something similar to region, region_key and instance_key. I think that the approach described by @keller-mark could be a solution to this, since it replaces those three variables by a smart use of a single one (named region). I will talk about this next.
  3. with the current or new specs one would still not be able to annotate multiple equivalent representations of the same set of instances (like cells that are seen both as polygons and as labels). With the new specs an (ok) workaround is to duplicate the table. We should think of something better than this later on to avoid data redundancy.

The proposal from @keller-mark would help with 2 because by requiring the user to specify region when calling the model parsers for elements and table (so the user knows what's going on with region), and by implementing shared indexing (as in my proposal), one would effectively achieve the behavior of region, region_key, instance_key by using the new region variable in place of the old region and region_key variables, but with self-documented/intuitive user experience. A downside of this would be that all the data models needs to be changed to accept the region value, and this would need to be saved in the metadata. We initially had this (called name), but we removed to keep all the metadata at the SpatialData level. Now changing the models and implementing multiple tables would be a much bigger work, so better to keep them separate. The consensus from the discussion was to start with the implementation I described, test and get feedback, and then try to iterate to get closer to the APIs from Mark.

Another advantage of the proposal from Mark would be a tighter integration of Muon. Since Muon would be supported in a follow up PR and not in the initial implementation of the multiple tables support, it will be better to postpone the Muon-like APIs to that moment.

The issue will remain open for additional comments and feedbacks, I plan a few weeks of work to merge the other PRs and finalize other work across the framework. If no blockers or strong objections arise, I plan to start implementing this after that.

dbdimitrov commented 10 months ago

Hey @LucaMarconato,

Great to see that you're actively working on this!

It wasn't exactly clear to me if the preferred format for multi-modal data is multiple AnnDatas or a single MuData. Would be nice to get your suggestion on this :)

Daniel

LucaMarconato commented 10 months ago

Hi @dbdimitrov, ultimately we want to support MuData and this would be the preferred format. For the moment the conversions between MuData and SpatialData need to be done by hand, and the MuData is generally preferred over multiple AnnData for its extended capabilities via the muon APIs.

aeisenbarth commented 10 months ago

Hi, I've been implementing SpatialData support for a high-throughput use case (slides with multiple wells). With the single-table design I am having some struggles but also see its benefits. I try to describe here our use case so it can be taken into account in the multi-table design.

Principles:

Use case: In a single SpatialData, we have ca. 2 microscopy modalities stored in images (can be whole-slide or per well). We crop wells from it and for each we compute two segmentations as labels. We compute region properties and store them in obs with the region referring to the labels. We also store mass spectrometry measurements in X for a subset of segmented instances (where no measurements, X will contain NaN). We compute pairwise relationships between labels instances (per well, but not between different wells).

spacem-ht-spatialdata-erd drawio

One SpatialData for all wells:

Single-table for all elements:

Multiple-table (1 table per element):

In my view, the new proposal should work for us, but since it is about in-memory representation without changing the single on-disk table, it still needs incremental table updates. The need to add some rows/columns without Pandas operations that return a new (merged) table is pressing for us.

aeisenbarth commented 10 months ago

With the single-table design, I see a huge limitation. I don't know whether this has been already considered or how a new in-memory API on top of the single table on disk solves this.

When concatenating SpatialData tables (for example multiple tables for different regions), not all columns might be available in all tables. There will be missing values, and a single table must support missing values for any data type. With multiple tables, we can simply avoid empty columns and minimize the chance of missing values.

As it is now, Pandas/AnnData fill them with NaN, changing the data type if necessary (!). For example an integer column concatenated with an empty column becomes a column containing floats or NaN. That means whoever wants to read values from the supposed integer column must precautionarily convert to int to avoid follow up errors.

kevinyamauchi commented 10 months ago

Hey @aeisenbarth ! I agree the single table design has serious limitations. Thank you for sharing your thoughts and use case. I just wanted to mention that @melonora is going to be taking on adding multiple table support starting next week.

LucaMarconato commented 10 months ago

@aeisenbarth thank you for sharing your use multi-well use case in your comment.

Multiple-table (1 table per element):

➕ Easier to build incrementally ➕ Easier for users to select AnnDatas for specific regions ➕ var is sometimes used to store summary statistics (per-well), like n_counts from scanpy.pp.filter_genes. In a single table for many wells, we have no place to store such statistics for slices of some rows, since var values refer to all rows. ➖ Requires stand-alone tables for data associated with no element or multiple elements (inter-element relationships).

I agree with you on the benefits that the new multiple table design would have. In addition to this, also the last point should not be a problem because the new design will allow for singleton tables (not associated to elements) to be stored, so it will be possible to store for instance one of such table, containing only the obsp, for each well.

One note on this is that if one index is changed in one "regionprobs" table, the change will not be forwarded to the corresponding "obsp" table. Anyway, I think that this should not be a problem because in the context of constructing a data incrementally, the indices would not change after constructing the object.

One additional note relative to your use case is that we are going to allow for nested NGFF hierarchies. So it will be possible to open and write a single SpatialData object with a custom subgroup hierarchy, than in your case could be organized by slide and wells. The issue to keep track on this is this one: https://github.com/scverse/spatialdata/issues/398. We will work on this on a separate branch that we will merge to main only after some other breaking changes will be implemented. You will be able to keep track of the progress here: https://github.com/scverse/spatialdata/milestone/1.

LucaMarconato commented 10 months ago

I also comments on the benefits of the other two approaches that you mentioned: "One SpatialData for all wells" and "Single-table for all elements". I will not comment on the limitations since the multiple table design would relax them.

One SpatialData for all wells:

➕ Users can directly analyze high-throughput data and avoid concatenation. I'd not like to take all per-well SpatialData and write again a concatenated copy of them.

Concatenation and deconcatenation will be required, this will introduce some overhead but I believe that this is acceptable considering all the other implied benefits of the multiple table design. Also we could mitigate this by providing some convenient helper functions.

Single-table for all elements:

➕ We can have annotations in obs referring having multiple representations (points, labels).

The reality is that also with a single table one needs to choose what to annotates (for instance labels or shapes) and to switch one would have to update the table metadata. We do this for example in the screenshot here (from the Xenium Visium notebook. image

With the new design we will also not allow to have one table annotating multiple elements, but it will be easier to map one table to another element since now the table will be matched to the shapes/labels by table.obs.index and not by the combination of the region_key and instance_key columns.

➕ We would like to store inter-element relationships in obsp. These are between instances of segmentations of a well, but not between wells. In a single table, obsp grows with n_obs², but is very very sparse. We would need to be sure it is sparse on disk. In general, the same is true for all other AnnData elements that may have values only for some rows.

As discussed in my previous message, storing tables containing obsp and not associated to any element will be allowed.

aeisenbarth commented 8 months ago

I'm glad to see the progress on this issue! Just documenting another point that can/will be solved through multiple tables:

Per-region variable/feature aggregation metrics

table.var["sum"] = table.loc[table.obs.region=="region1"].X.sum(axis=0)

table.var["sum"] = table.loc[table.obs.region=="region2"].X.sum(axis=0)

These are metrics computed for each matrix column (feature), like mean/std (which could easily be computed later whenever needed), but also custom metrics which we really want to store. Typically, these are stored as variable annotations in the var dataframe (e.g. n_counts in scanpy). However, with a single table comprising multiple regions (including regions with empty/NaN X matrix), there is only a single var dataframe which applies to all observations, not a subset of them.

keller-mark commented 1 month ago

Is there a plan for the multiple tables to be reflected in the on-disk format? (or is that not intended?)

I did not see any issues related to this but only did a quick search

LucaMarconato commented 1 month ago

To not represent region, region_key, instance_key in-memory, and instead relying on the same shared indexing between the table and the (single) corresponding shapes/labels element.

In the end the original idea brainstormed in the first post didn't make it into the final implementation. This came from first trying a draft implementation where we tried avoiding using region, region_key and instance_key but we then realized that indices were not suitable for the safe storage of the relationship between the tables and instances (because reindexing operations are common).

Long story short, now the multiple table in-memory design still matches 100% the original storage format, simply we allow for multiple tables being present (which was never a restriction on-disk).

LucaMarconato commented 1 month ago

Two more comments.

First, to improve the user experience (which was one of the scope of the original design prosed in this issue) we implemented a series of APIs, such as the join operations and functions to easily re-assign the target of a table to another element.

Second, in-memory we realized we can actually drop region, so when we will implement this, this will be the only difference between the in-memory and storage format: https://github.com/scverse/spatialdata/issues/629.