scverse / spatialdata

An open and interoperable data framework for spatial omics data
https://spatialdata.scverse.org/
BSD 3-Clause "New" or "Revised" License
213 stars 41 forks source link

ImageTilesDataset throws RuntimeError during validation when instantiated #628

Closed tsvvas closed 1 month ago

tsvvas commented 2 months ago

Hello @LucaMarconato,

I continue with the analysis of xenium + post-xenium IHC. After rasterization I want to visualize the cell nuclei in IHC, and it seems that ImageTilesDataset from the deep learning tutorial is the way to do that.

However, ImageTilesDataset throws a RuntimeError when instantiated, telling me that the provided indices are not annotated by the table. Seems like the problem is in almost unnecessary at the first glance check in the lines 213-215.

In my dataset get_table_keys returns keys only for one spatialdata element, which is cell_circles:

>>> spatialdata.models.get_table_keys(sdata.tables["table"])
('cell_circles', 'region', 'cell_id')
>>> sdata.tables["table"].obs.region.value_counts()
region
cell_circles            148354
Name: count, dtype: int64

Should I manually add the annotations for the other elements? I couldn't find a setter for those in documentation.

It also seems that the data structure in unsorted doesn't allow it to have many annotation columns at the same time, as there is only one region key per table.

How can I correctly set the attributes to avoid the RuntimeError?

Many thanks, Vasily

tsvvas commented 2 months ago

Reading the issues, seems like I found the setter. However, it also throws an error during checking target region column symmetry, probably because my new element has only a subset of indices from the table:

>>> sdata.set_table_annotates_spatialelement(table_name="table", region="nuclei_subset_shapes")
FIle ...spatialdata/models/models.py:1078, in check_target_region_column_symmetry(table, region_key, target)
...
ValueError: Mismatch(es) found between regions in region column in obs and target element: cell_circles
LucaMarconato commented 2 months ago

Hi Vasily,

in this tutorial we show how to manipulate the table, I would suggest to have a look at it in case as it illustrate some functions that you could find useful to address your issue https://spatialdata.scverse.org/en/latest/tutorials/notebooks/notebooks/examples/tables.html.

Also, the function get_element_instances() (recently added, not in the tutorial yet), could be useful.

Anyway, a quick fix could be to call join_spatialelement_table() with how='right' and then use the dataloader on the returned circles object.

If you had a labels object this would not be possible as (as the docs for join_spatialelement_table() say), the right join is not available for labels. In such a case you could add rows in the matrix by keeping the same value for the region_key column (in your case it's called 'region') and adding extra rows for the unannotated labels instances using the instance_key column (in your case it's called 'cell_id').

Please let me know if this leads to a solution for your problem.

tsvvas commented 2 months ago

Hi Luca,

I managed to patch the region related attributes with the following function:

from pandas.api.types import is_numeric_dtype

def patch_table_region_attrs(
    sdata: spatialdata.SpatialData,
    element: str,
    table: str = "table",
    region_key: str = "region",
    instance_key: str = "cell_id",
):
    other = "other"
    ids = spatialdata.get_element_instances(sdata[element])
    tab = sdata.tables[table]
    if not is_numeric_dtype(ids):
        ids = tab.obs[instance_key][tab.obs[instance_key].isin(ids)].index
    region_values = [other] * tab.shape[0]
    region_col = pd.Categorical(
        region_values, categories=[element, other], ordered=False
    )
    region_col[ids] = element
    attrs = {
        "region": element,
        "region_key": region_key,
        "instance_key": instance_key,
    }
    sdata.tables[table].obs[region_key] = region_col
    sdata.tables[table].uns["spatialdata_attrs"] = attrs
    return sdata

Now I get another error during ImageTilesDataset instantiation for AnnData:

File ...site-packages/spatialdata/dataloader/datasets.py:262), in ImageTilesDataset._preprocess(self, tile_scale, tile_dim_in_units, rasterize, table_name)
    261 if table_name is not None:
--> 262     table_subset = filtered_table[filtered_table.obs[region_key] == region_name]
File ...site-packages/anndata/_core/anndata.py:1066), in AnnData._normalize_indices(self, index)
    1065 def _normalize_indices(self, index: Index | None) -> tuple[slice, slice]:
--> 1066     return _normalize_indices(index, self.obs_names, self.var_names)
File ...site-packages/anndata/_core/index.py:53), in _normalize_index(indexer, index)
    53 if not isinstance(index, pd.RangeIndex):
    54    msg = "Don’t call _normalize_index with non-categorical/string names"
    55    assert index.dtype != float, msg
    56    assert index.dtype != int, msg
AssertionError: Don’t call _normalize_index with non-categorical/string names

Seems like the message is a bit misleading, and the problem is in the way obs_names are stored in the table:

>>> sdata.tables["table"].obs_names.dtype
dtype('int64')
>>> sdata.tables["table"].obs_names.dtype != int
False
LucaMarconato commented 2 months ago

Nice that you managed to fix the problem. I believe that the bug you reported is due to a limitation of anndata, which doesn't currently allow for having integers as obs (see here: https://github.com/scverse/anndata/issues/777). Converting the obs_names or obs.index to strings should fix the issues.

In spatialdata we don't rely on obs_names or obs.index because we want to allow for both integers and strings to be as names for the instances; this is the reason why we introduced the instance_key column. In other words, the obs_names could be anything, spatialdata will not look at them; the link between the elements and the table are exclusively made via the region, region_key and instance_key information.

tsvvas commented 1 month ago

Yes, changing data type solves the last issue.

sdata.tables["table"].obs_names = sdata.tables["table"].obs_names.map(str)

Thank you!