scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
576 stars 152 forks source link

Writing to H5AD fails if AnnData contains datetime64 objects #455

Open lazappi opened 3 years ago

lazappi commented 3 years ago

Writing an AnnData object to disk fails if it contains numpy datetime64 objects. Example:

import anndata
import numpy as np

ad = anndata.AnnData()

ad.uns['now'] = np.datetime64('now')

ad.write("test_h5py.h5ad")
ERROR MESSAGE ```pytb Traceback (most recent call last): File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/utils.py", line 188, in func_wrapper return func(elem, key, val, *args, **kwargs) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 144, in write_not_implemented raise NotImplementedError( NotImplementedError: Failed to write value for uns/now, since a writer for type has not been implemented yet. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/luke.zappia/Desktop/test_h5py.py", line 8, in ad.write("test_h5py.h5ad") File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1846, in write_h5ad _write_h5ad( File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 112, in write_h5ad write_attribute(f, "uns", adata.uns, dataset_kwargs=dataset_kwargs) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/functools.py", line 877, in wrapper return dispatch(args[0].__class__)(*args, **kw) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 126, in write_attribute_h5ad _write_method(type(value))(f, key, value, *args, **kwargs) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 286, in write_mapping write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs=dataset_kwargs) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/functools.py", line 877, in wrapper return dispatch(args[0].__class__)(*args, **kw) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 126, in write_attribute_h5ad _write_method(type(value))(f, key, value, *args, **kwargs) File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/utils.py", line 191, in func_wrapper raise type(e)( NotImplementedError: Failed to write value for uns/now, since a writer for type has not been implemented yet. Above error raised while writing key 'uns/now' of from /. ```

This behaviour is from h5py rather than anndata. From v3.0.0 they suggest storing these values using opaque dtypes https://docs.h5py.org/en/stable/special.html#opaque-dtypes.

Not sure how big a problem this is but it came up here https://github.com/theislab/zellkonverter/issues/24 so I just wanted to raise it. Would be nice if anndata could handle this (at least with a nicer error) but maybe not a priority.

ivirshup commented 3 years ago

We are aiming to support more types, as well as an interface for registering extensions, for reading and writing. Date times seem like a good case for first party support, depending on how complicated this gets. I'm worried this is going to be complicated.

Some points on this:

lazappi commented 3 years ago

Yeah, I agree it could get complicated so up to maintainers whether it is worth the effort. The datetime64 type came up as this is what reticulate converts R datetime to, but possibly not many use cases beyond that.

flying-sheep commented 3 years ago

datetime64 basically means “UTC time”, so if we use it, we should make sure to convert e.g. datetime objects with a set timezone to UTC.

Timezones are hella complicated and change all the time, so what we shouldn’t do is convert them to offsets and store them that way I think …

ivirshup commented 3 years ago

@flying-sheep, I think we would need to store them in a way that pd.Timestamp object read out exactly equal to the way they are written. We can't try to normalizes timezone due to problems like "mean expression of samples collected on tuesday" giving different results after you read the data from disk.


Looking at the arrow implementation, this might actually be straightforward. A date time column would be stored as a dataset of 64 bit integers. The attributes include the time unit (typically nanoseconds) and optionally a timezone. All times in a column must share a timezone (pandas represents these as object arrays anyways). Numpy date times are the same except there is no timezone.

Draft functions (these will need decorators for dispatch, additional metadata etc.):

import numpy as np
import pandas as pd
import h5py

def write_datetime(f, k, v, *, dataset_kwargs={}):
    d = f.create_dataset(k, data=v.view(np.int64), **dataset_kwargs)
    d.attrs["unit"] = np.datetime_data(v)[0]

def read_datetime(d: h5py.Dataset) -> np.ndarray:
    unit = d.attrs["unit"]
    return d[...].view(f"datetime[{unit}]")

def write_timestamp(f, k, v, *, dataset_kwargs={}):
    write_datetime(f, k, v, dataset_kwargs=dataset_kwargs)
    if v.tz is not None:
        d.attrs["tz"] = v.tz

def read_timestamp(d: h5py.Dataset) -> pd.arrays.DatetimeArray:
    arr = read_datetime(d)
    if "tz" in d.attrs:
        tz = d.attrs["tz"]
        unit = d.attrs["unit"]
        dtype = pd.DatetimeTZDtype(unit=unit, tz=tz)
    else:
        dtype = arr.dtype
    return pd.arrays.DatetimeArray(arr, dtype=dtype)

This should be similar for zarr, except that zarr natively supports the numpy datetime dtype, so no need for read_datetime or write_datetime.

ivirshup commented 2 years ago

So, it turns out datetimes are super complicated and contentious. To some extent we are beholden to whatever pandas does, but it looks like what pandas does is under active development (https://github.com/pandas-dev/pandas/issues/40932).

Some notes on the state of time types in the our dependencies:

Other libraries: