Open etienneschalk opened 8 months ago
Thanks for (re-)raising this @etienneschalk .
This means xarray cannot open any Zarr file, but only those who possess an xarray's special private attribute,
_ARRAY_DIMENSIONS
.
Note that in zarr v3 this was formalized as the dimension_names
. But those are optional, so we still have the same problem.
In a first phase, the error message can probably be more explicit (better than a low-level KeyError), explaining that xarray cannot yet open arbitrary Zarr data.
A better error message would be good.
The other option would be to automatically create default dimension names like we do in the constructors
In [7]: da = xr.DataArray(np.arange(12).reshape(3, 2, 2))
In [8]: da.dims
Out[8]: ('dim_0', 'dim_1', 'dim_2')
I think there is actually a similar but separate issue with another type of zarr store that cannot be opened with xarray. My understanding is that Zarr doesn't actually require that arrays in the same group with shared dimension names actually have the same size along that dimension. This would break the model of xarray.Dataset
for that group.
The other option would be to automatically create default dimension names like we do in the constructors
Here are some examples:
(1) A Zarr group with two 1-dimensional variables with anonymous dimensions but same positional dimension sizes) should be openable with xarray, like we can create this Dataset:
xr.Dataset({"xda_1": xr.DataArray([1]), "xda_2": xr.DataArray([2])})
<xarray.Dataset>
Dimensions: (dim_0: 1)
Dimensions without coordinates: dim_0
Data variables:
xda_1 (dim_0) int64 1
xda_2 (dim_0) int64 2
(2) A Zarr group with two 1-dimensional variables with anonymous dimensions and differing positional dimension sizes would not be openable with xarray. The error would be similar to try to create this Dataset:
xr.Dataset({"xda_1": xr.DataArray([1]), "xda_2": xr.DataArray([1, 2])})
ValueError: cannot reindex or align along dimension 'dim_0' because of conflicting dimension sizes: {1, 2}
This error would be better than the KeyError
: xarray made its best effort to open the Zarr, but will not break its own data model.
To generalize:
Xarray could read Zarr groups, by following the same rules of when building a Dataset from DataArrays with no explicit dimensions names. Error handling would be somehow delegated to the same logic that throws the ValueError
above (to remove logic duplication, not re-implementing existing error handling)
dim_0
... dim_n
. Then, the only way for the Zarr to be readable is to have ordered dimensions of the same size from left to right. The k-th dim must have the same size across all variables of the group. This is the same requirement as when passing multiple DataArrays without explicit dimension names to the Dataset constructor ;_ARRAY_DIMENSIONS
or Zarr 3's dimension_names
), then the Zarr store gets more freedom for ordering the dimensions (named rather than positional, to re-use the terminology from Indexing and Selecting data. However, all dimensions sharing the same name should be of the same size (xarray constraint)Do you think this makes sense?
I think there is actually a similar but separate issue with another type of zarr store that cannot be opened with xarray.
Do you have the link to the issue aforementioned? This would then be an expected behaviour according to the existing Dataset rules: this kind of Zarr would not be readable by xarray as dimensions among a group are shared between its variables and must be of the same size.
To conclude, there is not a perfect match between the dimensions concept from xarray and the dimensions concept from Zarr. However, recycling the rules of the Dataset constructor to open Zarr groups might increase the size of the space of "xarray-compatible" Zarrs, while preserving xarray data model.
What is your issue?
Original issue: https://github.com/xarray-contrib/datatree/issues/280
Note: this issue description was generated from a notebook. You can use it to reproduce locally the bug.
Lack of resilience towards missing
_ARRAY_DIMENSIONS
xarray's special Zarr attribute_Utilities
This section only declares utilities functions and do not contain any additional value for the reader
Data Creation
I create a dummy Dataset containing a single
(label, z)
-dimensional DataArray namedmy_xda
.Data Writing
I persist the Dataset to Zarr
Data Initial Reading
I read successfully the Dataset
Data Alteration
Then, I alter the Zarr by removing successively all of the
_ARRAY_DIMENSIONS
from all of the variables'.zattrs
:z
,label
,my_xda
, and try to reopen the Zarr. It is in all cases a success. ✔️However, the last alteration, which is removing the
_ARRAY_DIMENSIONS
key-value pair from one of the variables in the.zmetadata
file present at the root of the zarr, results in an exception when reading. The error message is explicit:KeyError: '_ARRAY_DIMENSIONS'
❌This means xarray cannot open any Zarr file, but only those who possess an xarray's special private attribute,
_ARRAY_DIMENSIONS
.See https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html
In a first phase, the error message can probably be more explicit (better than a low-level
KeyError
), explaining that xarray cannot yet open arbitrary Zarr data.xr.show_versions()