pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

h5netcdf-engine now reads attributes with array length 1 as scalar #6570

Closed erik-mansson closed 1 year ago

erik-mansson commented 2 years ago

What is your issue?

The h5netcdf engine for reading NetCDF4-files was recently changed https://github.com/h5netcdf/h5netcdf/pull/151 so that when reading attributes, any 1D array/list of length 1 gets turned into a scalar element/item. The change happened with version 0.14.0.

The issue is that the xarray documentation still describes the old h5netcdf-behaviour on https://docs.xarray.dev/en/stable/user-guide/io.html?highlight=attributes%20h5netcdf#netcdf

Could we mention this also on https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset under the engine argument, or just make sure it links to the above page?

I initially looked under https://docs.xarray.dev/en/stable/user-guide/io.html?highlight=string#string-encoding because my issue was for a string array/list, but maybe too much to mention there if this is a general change that affects attributes of all types.

As explained on the h5netcdf-issue tracker, the reason for dropping/squeezing 1-length-array attributes to scalars, is for compatibility with the other NetCDF4-engine or NetCDF in general (and there might be some varying opinions about how good that is, vs. fully using features available in HDF5). (Interesting to note is that when writing, an attribute with a python list of length 1 does give an array of length 1 in the HDF5/NetCDF4-file, the dropping of array dimension only happens only when reading.)

Adding the invalid_netcdf=True argument when loading does not change the behaviour. Maybe it could be interesting to use it to generally allow 1-length attribute arrays? Now, I think every usage of array-attributes will need conversions like list_in_recent_version = attribute if isinstance(attribute, list) else [attribute] or always_list = list(attribute if isinstance(attribute, (list, np.ndarray)) else [attribute]) to support both old and new versions. Otherwise, iterating over an attribute string will cause surprises by iterating over its characters instead of doing a single iteration that yields the single string (as in older versions).

Minimal example

This serves to clarify what happens. The issue is not about reverting to the old behaviour (although I liked it), just updating the xarray documentation.

import xarray as xr
import numpy as np
ds = xr.Dataset()
ds['stuff'] = xr.DataArray(np.random.randn(2), dims='x')
ds['stuff'].attrs['strings_0D_one'] = 'abc'
ds['stuff'].attrs['strings_1D_two'] = ['abc', 'def']
ds['stuff'].attrs['strings_1D_one'] = ['abc']
path = 'demo.nc'
ds.to_netcdf(path, engine='h5netcdf', format='netCDF4')
ds2 = xr.load_dataset(path, engine='h5netcdf')

print(type(ds2['stuff'].attrs['strings_0D_one']).__name__, repr(ds2['stuff'].attrs['strings_0D_one']))
print(type(ds2['stuff'].attrs['strings_1D_two']).__name__, repr(ds2['stuff'].attrs['strings_1D_two']))
print(type(ds2['stuff'].attrs['strings_1D_one']).__name__, repr(ds2['stuff'].attrs['strings_1D_one']))

With h5netcdf: 0.12.0 (python: 3.7.9, OS: Windows, OS-release: 10, libhdf5: 1.10.4, xarray: 0.20.1, pandas: 1.3.4, numpy: 1.21.5, netCDF4: None, h5netcdf: 0.12.0, h5py: 2.10.0) the printouts are:

str 'abc' ndarray array(['abc', 'def'], dtype=object) ndarray array(['abc'], dtype=object)

With h5netcdf: 1.0.0 (python: 3.8.11, OS: Linux, OS-release: 3.10.0-1160.49.1.el7.x86_64, libhdf5: 1.10.4, xarray: 0.20.1, pandas: 1.4.2, numpy: 1.21.2, netCDF4: None, h5netcdf: 1.0.0, h5py: 2.10.0) the printouts are:

str 'abc' list ['abc', 'def'] str 'abc'

I have tested that direct reading by h5py.File gives str, ndarray, ndarray so the change is not in the writing or h5py.

kmuehlbauer commented 2 years ago

The change is somewhere between version 0.12.0 and 1.0.0 (one user says from 0.14, but I haven't verified).

It's from h5netcdf 0.14 onwards, please see https://github.com/h5netcdf/h5netcdf/blob/main/CHANGELOG.rst.

Adding the invalid_netcdf=True argument when loading does not change the behaviour.

That keyword argument is distributed verbatim to h5netcdf. It tells h5netcdf to write invalid netcdf features if there are any. This will then effectively create files which might not be readable by netcdf-c/netcdf4-python. It has no meaning for reading files.

I have tested that direct reading by h5py.File gives str, ndarray, ndarray so the change is not in the writing or h5py.

The change is in h5netcdf reading attributes, where we have aligned it with netcdf4-python.

I think updating the docs as suggested is fine.