pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

merging and saving loaded datasets can lead to string truncation #9757

Open jcmgray opened 5 days ago

jcmgray commented 5 days ago

What happened?

If one:

  1. loads a dataset saved using engine="h5ncetdf" with a string coordinate say <U2
  2. merges it with another dataset which matches but has longer strings in the same coordinate, say <U4
  3. then saves that merged dataset using engine="h5ncetdf"
  4. then the encoding from loading the initial dataset, which survives the merge, causes the dataset variable to be silently truncated back to "<U2", such that when it is loaded again the data is incorrect.

This is specific to the "h5netcdf" engine. This doesn't happen however with the "scipy" engine.

What did you expect to happen?

I guess the encoding should be dropped or updated during the merge call.

Minimal Complete Verifiable Example

import xarray as xr

engine = "h5netcdf"

ds1 = xr.Dataset(coords={'x': ['ab', 'bc', 'c']})
ds1.to_netcdf('ds1.h5', engine=engine)
ds1 = xr.open_dataset('ds1.h5', engine=engine)
ds1.close()

ds2 = xr.Dataset(coords={'x': ['abc', 'bcd', 'cd']})
ds2 = ds1.merge(ds2)
print(ds2.x.encoding)
print("expected", ds2.x.values)
ds2.to_netcdf('ds1.h5', engine=engine)

ds2 = xr.open_dataset('ds1.h5', engine=engine)
ds2.close()
print("loaded  ", ds2.x.values)
# {'dtype': dtype('<U2')}
# expected ['ab' 'abc' 'bc' 'bcd' 'c' 'cd']
# loaded   ['ab' 'ab' 'bc' 'bc' 'c' 'cd']

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 5.15.0-124-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: None xarray: 2024.10.0 pandas: 2.2.3 numpy: 2.0.2 scipy: 1.14.1 netCDF4: None pydap: None h5netcdf: 1.4.0 h5py: 3.11.0 zarr: None cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2024.10.0 distributed: 2024.10.0 matplotlib: 3.9.2 cartopy: None seaborn: 0.13.2 numbagg: None fsspec: 2024.9.0 cupy: None pint: None sparse: 0.15.4 flox: None numpy_groupies: None setuptools: 75.1.0 pip: 24.2 conda: None pytest: 8.3.3 mypy: None IPython: 8.28.0 sphinx: 8.1.3
kmuehlbauer commented 3 days ago

@jcmgray Thanks for the well written issue and the MCVE.

I get the same output with engine="netcdf4", so I would not say that's specific to the h5netcdf engine, but is with the distribution of encoding.

Best answer is to drop encoding in such cases. See a more in-depth discussion on encoding here #6323. Please also have a read here: https://docs.xarray.dev/en/stable/user-guide/io.html#reading-encoded-data and let us know, if and how the documentation could be improved to make this more clear.

jcmgray commented 1 day ago

Hi @kmuehlbauer, thanks for the response. Apologies for missing the prior discussion around encoding - indeed simply dropping the encoding works perfectly for me, feel free to close.

For what its worth my thoughts behavior/docs-wise (as a user who hasn't needed think about encoding before):

  1. having it dropped automatically would make sense to me
  2. a warning that data truncation is happening on write might be nice (it was quite hard to pin down exactly where this was happening!)
  3. similarly, it might be good to warn in the docs that if encoding and data-type get out of sync it can lead to truncation of data