pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Storing np.datetime64 attributtes in zarr files #9567

Open CarlAndersson opened 1 day ago

CarlAndersson commented 1 day ago

What happened?

I have a dataset with an attribute which is a time, stored as a np.datetime64 value with nanosecond precision. Saving this to a zarr store and loading the dataset again drops the type of this attribute and loads it as an integer.

Example dataset:

<xarray.DataArray (x: 5)> Size: 20B
array([0, 1, 2, 3, 4])
Dimensions without coordinates: x
Attributes:
    time:     2024-10-02T07:39:39.000000000

gets loaded back as

<xarray.DataArray (x: 5)> Size: 20B
[5 values with dtype=int32]
Dimensions without coordinates: x
Attributes:
    time:     1727854779000000000

Using second precision for the datetime64 (instead of nanosecond) raises an error on json serialization, since it gets converted into a datetime at some point.

What did you expect to happen?

The time gets stored and read back properly.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

arr = xr.DataArray(
    np.arange(5),
    dims="x",
    attrs={"time": np.datetime64("now", "ns")},
)
print(arr)
arr.to_zarr("temp.zarr", mode="w")
print(xr.open_dataarray("temp.zarr", engine="zarr"))

arr = xr.DataArray(
    np.arange(5),
    dims="x",
    attrs={"time": np.datetime64("now", "s")},
)
print(arr)
arr.to_zarr("temp.zarr", mode="w")

MVCE confirmation

Relevant log output

Traceback (most recent call last):
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 395, in _put_attrs
    zarr_obj.attrs.put(attrs)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 124, in put
    self._write_op(self._put_nosync, d)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 83, in _write_op
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\attrs.py", line 150, in _put_nosync
    self.store[self.key] = json_dumps(d)
                           ^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\util.py", line 69, in json_dumps
    return json.dumps(
           ^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\zarr\util.py", line 64, in default
    return json.JSONEncoder.default(self, o)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\json\encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\core\dataarray.py", line 4355, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\api.py", line 1784, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\api.py", line 1467, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 720, in store
    self.set_variables(
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 831, in set_variables
    zarr_array = _put_attrs(zarr_array, encoded_attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\<username>\repos\test\.pixi\envs\default\Lib\site-packages\xarray\backends\zarr.py", line 397, in _put_attrs
    raise TypeError("Invalid attribute in Dataset.attrs.") from e
TypeError: Invalid attribute in Dataset.attrs.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 17:48:58) [MSC v.1941 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 12, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('Swedish_Sweden', '1252') libhdf5: None libnetcdf: None xarray: 2024.9.0 pandas: 2.2.3 numpy: 2.1.1 scipy: None netCDF4: None pydap: None h5netcdf: None h5py: None zarr: 2.18.3 cftime: None nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: None pip: None conda: None pytest: None mypy: None IPython: None sphinx: None
max-sixty commented 1 day ago

Do we want to have arbitrary python objects stored in attrs? We serialize to json so arguably need to constrain ourselves to types that are JSON-compatible...

keewis commented 23 hours ago

the question is, would zarr be able to store datetimes without encoding? If so, I believe it may be possible to extend the zarr backend specifically to allow this (though not sure if that would make the encoding machinery too complicated?).

max-sixty commented 23 hours ago

We could ofc serialize and deserialize into our own propriety format. But I'm not sure what the interface would be?

keewis commented 23 hours ago

In this case I was just wondering whether we can get away with not serializing datetimes at all (but only for the zarr backend, if the zarr format supports this).

I agree that serializing attributes might be useful (see the many CRS representations, for example) but potentially too complex at this point. Also, a custom format convention both means a lot of work and won't be compatible with other libraries, especially from other languages.