scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
571 stars 152 forks source link

Saving .h5ad with pd.Series in .uns results in IORegistryError #1429

Open JWatter opened 6 months ago

JWatter commented 6 months ago

Please make sure these conditions are met

Report

Hi all,

anndata objects dont serialize as h5ad if they contain a pandas series. This is related to some comments in #797.

Code:

import anndata as ad
import pandas as pd
adata_uns_series = ad.AnnData()
adata_uns_series.uns['series'] = pd.Series({'a':1,'b':2,'c':3})
adata_uns_series.write('adata_uns_series.h5ad') # error

Traceback:

Traceback (most recent call last):
  File "scratch/anndata_error.py", line 5, in <module>
    adata_uns_series.write('adata_uns_series.h5ad') # error
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_core/anndata.py", line 1929, in write_h5ad
    write_h5ad(
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/h5ad.py", line 111, in write_h5ad
    write_elem(f, "uns", dict(adata.uns), dataset_kwargs=dataset_kwargs)
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/registry.py", line 359, in write_elem
    Writer(_REGISTRY).write_elem(store, k, elem, dataset_kwargs=dataset_kwargs)
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/utils.py", line 243, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/registry.py", line 309, in write_elem
    return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/registry.py", line 57, in wrapper
    result = func(g, k, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/methods.py", line 312, in write_mapping
    _writer.write_elem(g, sub_k, sub_v, dataset_kwargs=dataset_kwargs)
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/utils.py", line 243, in func_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/registry.py", line 304, in write_elem
    self.find_writer(dest_type, elem, modifiers),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/registry.py", line 269, in find_writer
    return self.registry.get_writer(dest_type, type(elem), modifiers)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/envs/anndata_0.10.6/lib/python3.12/site-packages/anndata/_io/specs/registry.py", line 117, in get_writer
    raise IORegistryError._from_write_parts(dest_type, src_type, modifiers)
anndata._io.specs.registry.IORegistryError: No method registered for writing <class 'pandas.core.series.Series'> into <class 'h5py._hl.group.Group'>
Error raised while writing key 'series' of <class 'h5py._hl.group.Group'> to /uns

Versions

-----
anndata             0.10.6
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.9.0
h5py                3.10.0
natsort             8.4.0
numpy               1.26.4
packaging           24.0
pandas              2.2.1
pytz                2024.1
scipy               1.12.0
six                 1.16.0
-----
Python 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]
Linux-5.4.0-144-generic-x86_64-with-glibc2.31
-----
Session information updated at 2024-03-20 16:53
ivirshup commented 6 months ago

Could you share your use case for this?

To me, storing a pandas Series is basically the same thing as storing a 1d xarray DataArray. I want to support storing xarray object, and I don't want to have two ways to store the same thing.

Would making this a single column dataframe on your end work here?

JWatter commented 6 months ago

Thanks for the quick response! I am gonna argue a bit for the pandas series :)

My immediate use case is serializing a mapping between coarse and fine categories which are also columns of .obsm dataframes. A pandas series is the natural object to use for this.

In principle, using a single column dataframe would work, as would using a plain dictionary. But both would be hacky. The cleaner way is using a pandas series. As far as I know pandas is the de facto standard for storing labeled 1d and 2d arrays. The labeled 2d arrays are already supported as pandas dataframe.

I do understand that it is somewhat redundant to have two ways to store the same thing. Then again, if you support xarray objects, then you introduce a second way of storing 2d labeled arrays as pandas dataframes are already supported, right?

Except for serialization a pandas series in .uns works just fine. One could restrict the types to write in .uns, or construct xarrays whenever pandas series or dataframes (or python dicts) are written into .uns. Personally, I would prefer the flexibility and symmetry in the support of pandas dataframes and pandas series.

.... Just some thoughts, what do you think?

ilan-gold commented 5 months ago

Good to look back once xarray support lands