pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Updating DataArray attributes in a Zarr archive does not work as for Dataset #8116

Open krokosik opened 1 year ago

krokosik commented 1 year ago

What happened?

When using the new functionality of calling to_zarr on DataArrays, I've noticed that the attributes are not updated and can only be set once, on initial archive creation. In our case, we are performing sequential writes to the Zarr regions and also want to update attributes. Regular Zarr API allows updating metadata and it also works with Datasets, so we suspect it's an issue with DataArrays. In fact, when calling to_zarr, the DataArray is converted to a Dataset, but when we check it, it does not have its attributes, as they live inside the nested array.

What did you expect to happen?

I would like to update attributes in the same manner as with plain Zarr or with Datasets. Currently, we have to resort to modifying the .zmetadata JSON and consider this dangerous and hacky.

Minimal Complete Verifiable Example

import xarray as xr

ds_path = "ds"
xr.Dataset().assign_attrs(a=1).to_zarr(ds_path, mode="w")
xr.open_dataset(ds_path, engine="zarr").assign_attrs(b=2).to_zarr(ds_path, mode="a")

da_path = "da"
xr.DataArray().assign_attrs(a=1).to_zarr(da_path, mode="w")
xr.open_dataarray(da_path, engine="zarr").assign_attrs(b=2).to_zarr(da_path, mode="a")

print(xr.open_dataset(ds_path, engine="zarr").attrs == xr.open_dataarray(da_path, engine="zarr").attrs)

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 183 Stepping 1, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('Polish_Poland', '1250') libhdf5: None libnetcdf: None xarray: 2023.7.0 pandas: 2.0.3 numpy: 1.25.2 scipy: 1.11.2 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.16.1 cftime: None nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.7.2 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.0.0 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.14.0 sphinx: 7.1.2
krokosik commented 1 year ago

I've managed to track down the code causing the issue. It looks like the comment was left by @shoyer. Could you elaborate on the issues you encountered when updating variables? I could help implementing this mechanism

https://github.com/pydata/xarray/blame/f13da94db8ab4b564938a5e67435ac709698f1c9/xarray/backends/zarr.py#L670-L681