scverse / spatialdata-io

BSD 3-Clause "New" or "Revised" License
42 stars 27 forks source link

Xenium 2.0.0 may give nuclei with NaN area: need change logic `cells_as_circles` #173

Closed pakiessling closed 3 months ago

pakiessling commented 4 months ago

Hi,

I am running into the following problem:

polygon = Polygon(
    [
        (200, 200),
        (450, 600),
        (450, 350),
        (900, 1000),
        (1200, 600),
    ]
)
from spatialdata import polygon_query
crop = polygon_query(sdata,polygon,target_coordinate_system="global")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[48], line 2
      1 from spatialdata import polygon_query
----> 2 crop = polygon_query(sdata,polygon,target_coordinate_system="global")

File ~/mambaforge/envs/spatial_data/lib/python3.9/functools.py:888, in singledispatch.<locals>.wrapper(*args, **kw)
    884 if not args:
    885     raise TypeError(f'{funcname} requires at least '
    886                     '1 positional argument')
--> 888 return dispatch(args[0].__class__)(*args, **kw)

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/spatialdata/_core/query/spatial_query.py:823, in _(sdata, polygon, target_coordinate_system, filter_table, shapes, points, images, labels)
    821 for element_type in ["points", "images", "labels", "shapes"]:
    822     elements = getattr(sdata, element_type)
--> 823     queried_elements = _dict_query_dispatcher(
    824         elements,
    825         polygon_query,
    826         polygon=polygon,
    827         target_coordinate_system=target_coordinate_system,
    828     )
    829     new_elements[element_type] = queried_elements
    831 tables = _get_filtered_or_unfiltered_tables(filter_table, new_elements, sdata)

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/spatialdata/_core/query/spatial_query.py:422, in _dict_query_dispatcher(elements, query_function, **kwargs)
    420 assert isinstance(d, dict)
    421 if target_coordinate_system in d:
--> 422     result = query_function(element, **kwargs)
    423     if result is not None:
    424         # query returns None if it is empty
    425         queried_elements[key] = result

File ~/mambaforge/envs/spatial_data/lib/python3.9/functools.py:888, in singledispatch.<locals>.wrapper(*args, **kw)
    884 if not args:
    885     raise TypeError(f'{funcname} requires at least '
    886                     '1 positional argument')
--> 888 return dispatch(args[0].__class__)(*args, **kw)

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/spatialdata/_core/query/spatial_query.py:902, in _(element, polygon, target_coordinate_system, **kwargs)
    899 polygon_gdf = _get_polygon_in_intrinsic_coordinates(element, target_coordinate_system, polygon)
    900 polygon = polygon_gdf["geometry"].iloc[0]
--> 902 buffered = to_polygons(element) if ShapesModel.RADIUS_KEY in element.columns else element
    904 OLD_INDEX = "__old_index"
    905 if OLD_INDEX in buffered.columns:

File ~/mambaforge/envs/spatial_data/lib/python3.9/functools.py:888, in singledispatch.<locals>.wrapper(*args, **kw)
    884 if not args:
    885     raise TypeError(f'{funcname} requires at least '
    886                     '1 positional argument')
--> 888 return dispatch(args[0].__class__)(*args, **kw)

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/spatialdata/_core/operations/vectorize.py:267, in _(gdf, buffer_resolution)
    265 if isinstance(gdf.geometry.iloc[0], Point):
    266     buffered_df = gdf.copy()
--> 267     buffered_df["geometry"] = buffered_df.apply(
    268         lambda row: row.geometry.buffer(row[ShapesModel.RADIUS_KEY], resolution=buffer_resolution), axis=1
    269     )
    271     # Ensure the GeoDataFrame recognizes the updated geometry column
    272     buffered_df = buffered_df.set_geometry("geometry")

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/geopandas/geodataframe.py:1568, in GeoDataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   1566 @doc(pd.DataFrame)
   1567 def apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwargs):
-> 1568     result = super().apply(
   1569         func, axis=axis, raw=raw, result_type=result_type, args=args, **kwargs
   1570     )
   1571     # pandas <1.4 re-attach last geometry col if lost
   1572     if (
   1573         not compat.PANDAS_GE_14
   1574         and isinstance(result, GeoDataFrame)
   1575         and result._geometry_column_name is None
   1576     ):

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/pandas/core/frame.py:10361, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10347 from pandas.core.apply import frame_apply
  10349 op = frame_apply(
  10350     self,
  10351     func=func,
   (...)
  10359     kwargs=kwargs,
  10360 )
> 10361 return op.apply().__finalize__(self, method="apply")

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/pandas/core/apply.py:916, in FrameApply.apply(self)
    913 elif self.raw:
    914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
   1061 def apply_standard(self):
   1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
   1064     else:
   1065         results, res_index = self.apply_series_numba()

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
   1078 with option_context("mode.chained_assignment", None):
   1079     for i, v in enumerate(series_gen):
   1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
   1082         if isinstance(results[i], ABCSeries):
   1083             # If we have a view on v, we need to make a copy because
   1084             #  series_generator will swap out the underlying data
   1085             results[i] = results[i].copy(deep=False)

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/spatialdata/_core/operations/vectorize.py:268, in _.<locals>.<lambda>(row)
    265 if isinstance(gdf.geometry.iloc[0], Point):
    266     buffered_df = gdf.copy()
    267     buffered_df["geometry"] = buffered_df.apply(
--> 268         lambda row: row.geometry.buffer(row[ShapesModel.RADIUS_KEY], resolution=buffer_resolution), axis=1
    269     )
    271     # Ensure the GeoDataFrame recognizes the updated geometry column
    272     buffered_df = buffered_df.set_geometry("geometry")

File ~/mambaforge/envs/spatial_data/lib/python3.9/site-packages/shapely/geometry/base.py:543, in BaseGeometry.buffer(self, distance, quad_segs, cap_style, join_style, mitre_limit, single_sided, **kwargs)
    541     raise ValueError("Cannot compute offset from zero-length line segment")
    542 elif not np.isfinite(distance).all():
--> 543     raise ValueError("buffer distance must be finite")
    545 return shapely.buffer(
    546     self,
    547     distance,
   (...)
    552     single_sided=single_sided,
    553 )

ValueError: buffer distance must be finite

Any idea what causes this?

sdata.query.bounding_box works fine

-----
geopandas           0.14.3
napari_spatialdata  0.5.0.post2.dev2+gd6e55f9
pandas              2.2.1
session_info        1.0.0
shapely             2.0.3
spatialdata         0.2.2.dev10+g6d8aeb8
-----
IPython             8.18.1
jupyter_client      8.6.1
jupyter_core        5.7.2
-----
Python 3.9.19 | packaged by conda-forge | (main, Mar 20 2024, 12:55:20) [Clang 16.0.6 ]
macOS-14.4.1-arm64-arm-64bit
-----
LucaMarconato commented 4 months ago

Hi, thank for reporting! Can you please paste the output of values count on the radius column of the circles (shapes elements) that are present in sdata? Thanks 😊

pakiessling commented 4 months ago

len(sdata.shapes["cell_circles"])

122259

sdata.shapes["cell_circles"].radius.isna().sum()

6978

sdata.shapes["cell_circles"].hist(bins=100) grafik

Interesting, I guess the NaN are the problem? This is Xenium data loaded with spatialdata_io.xeniun

LucaMarconato commented 4 months ago

Thanks for the details, yes we expect the NaN to be the problem. The radius is computed from the raw data with this line: https://github.com/scverse/spatialdata-io/blob/a26feba6092a79bd1d7d141f42a89e0bd9027ed5/src/spatialdata_io/readers/xenium.py#L509, could you please check what the area contains in the AnnData table?

Is the dataset public? In such a case I'd also like to have a look.

LucaMarconato commented 4 months ago

In your case a fix would be to recompute the area using geopandas https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.area.html, and manually adjust the radius column. Alternatively you could set the able to annotate the shapes and work directly with those instead of working with the cell circles. You can achieve this by setting cells_as_circles=False in xenium(), or, if the object is already created, you can use the method set_table_annotates_spatialelement() of the SpatialData class.

pakiessling commented 4 months ago

Unfortunately I can't share the data.

I subset the table to the problematic cells and I can see that they all have valid cell_area values but that nucleus_area is NaN for all of them.

Is it possible that those are cells where the Xenium pipeline only segmented cells?

The cells.parquet file already has the NaN in it

LucaMarconato commented 4 months ago

You are right, it is probably due to cells without nuclei, as shown in the first figure here: https://pages.10xgenomics.com/rs/446-PBO-704/images/AGBT_2024_Cell_Segmentation_Poster.pdf.

They changed the file format but we added support for it. I will move this issue to spatialdata-io and make a fix to the reader.

LucaMarconato commented 3 months ago

A solution to this issue, discussed in https://github.com/scverse/spatialdata-io/issues/123, is now available via https://github.com/scverse/spatialdata-io/pull/179.