pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

Allow track_order to be passed to h5netcdf #7680

Open abunimeh opened 1 year ago

abunimeh commented 1 year ago

Is your feature request related to a problem?

when using h5netcdf as a backend. Writing the same exact content to two different files results in unique md5 checksum for the two identical xarray files.

See https://github.com/h5netcdf/h5netcdf/issues/211

Describe the solution you'd like

When saving an nc file. allow track_order=False to be passed as an arg

Describe alternatives you've considered

using netcdf4 engine

Additional context

No response

jhamman commented 1 year ago

@abunimeh - Thanks for opening this issue. Can you expand on the feature a bit more? What API would you like to see? ds.to_netcdf(..., track_order=False)?

I suspect this will need to be treated like invalid_netcdf as it will only apply to the h5netcdf backend:

https://github.com/pydata/xarray/blob/86f3f21ab3d0dff6fdb4a0bccd27c62f9e4a3238/xarray/core/dataset.py#L1892-L1895

_Note: it would be nice if we had backend_kwargs on to_netcdf since the variety of options scipy/netcdf4/h5netcdf support are increasingly different.

kmuehlbauer commented 1 year ago

First, I totally agree with @jhamman having backend_kwargs on to_netcdf.

For the particular use case, netcdf-c/netCDF4-python create HDF5 files (NECTDF4-format) with track order enabled as required, see https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html#creation_order.

h5netcdf uses track_order=True as default since version 1.1.0. There have been (and still are, https://github.com/HDFGroup/hdf5/issues/1388) some corner case issues upstream which netcdf-c can somehow circumvent, but h5netcdf can't. Nevertheless, to be compliant with netcdf-c track_order=True is default for h5netcdf.

@abunimeh As a workaround until this is sorted out you could create the file (or subgroup) using h5py/h5netcdf with track_order=False. If a file (root-group) or sub-group in a file is created with track_order=False this will be persistent as it is set at group-define time. Then you can use to_netcdf as usual with mode="a" to append.


import xarray as xr
import h5netcdf
from time import sleep

ds = xr.Dataset(data_vars=dict(hello=(["x"], [1., 1., 1., 1., 1.])))

track_order = False
group = "/track"

with h5netcdf.File("sample1.nc", "a", track_order=track_order) as f1:
    if group.split("/")[-1]:
        f1.create_group(group)

ds.to_netcdf("sample1.nc", mode="a", engine="h5netcdf", group=group)                
sleep(5)

with h5netcdf.File("sample2.nc", "a", track_order=track_order) as f2:
    if group.split("/")[-1]:
        f2.create_group(group)

ds.to_netcdf("sample2.nc", mode="a", engine="h5netcdf", group=group)   

Update: Use mode="a" everywhere. Update2: Cave: You will not be able to append to this file with netcdf-c/netCDF4-python ever again.

abunimeh commented 1 year ago

Thanks @kmuehlbauer for explaining this.

@jhamman yes, i was hoping that I can pass ds.to_netcdf(..., track_order=False) when engine is hd5netcdf.

It would be nice to enhance backend_kwargs