pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.06k forks source link

`to_zarr` with `mode='a'` fails when `_FillValue` is present, caused by `open_zarr` with `mask_and_scale=False`. #9053

Open cpegel opened 3 months ago

cpegel commented 3 months ago

What happened?

When opening a Zarr dataset with open_zarr and then writing to it using to_zarr, we get a ValueError when the _FillValue attribute of at least one data variable or coordinate is present. This happens for example when opening the dataset with mask_and_scale=False.

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

The error can be worked around by deleting the _FillValue attributes of all data variables and coordinates before calling to_zarr. However, if the Zarr metadata contains meaningful fill_value attributes beforehand, they will then be lost after a round-trip of open_zarr (with mask_and_scale=False) and to_zarr.

The same behavior, but without the cause being mask_and_scale=False, has been reported in the open issues #6069 and #6329 without any solutions.

What did you expect to happen?

I expect to be able to read from and then write to a Zarr storage without having to delete attributes in between or lose available fill_value metadata from the Zarr storage. Calling to_zarr with mode='a' should just write any DataArrays _FillValue attribute to the fill_value field in the Zarr metadata instead of failing with a ValueError.

Minimal Complete Verifiable Example

import xarray as xr

# Create a dataset and write to Zarr storage

ds = xr.Dataset(dict(A=xr.DataArray([1.0])))
# ds.A.attrs is empty here.

zarr_path = "/path/to/storage.zarr"
ds.to_zarr(zarr_path, mode='a')

# Read the dataset from Zarr again using `mask_and_scale=False`

ds = xr.open_zarr(zarr_path, mask_and_scale=False)
# ds.A.attrs is now {'_FillValue': nan}

# Write the dataset to Zarr again
ds.to_zarr(zarr_path, mode='a')

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 07:53:56) [MSC v.1937 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 13, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: ('de_DE', 'cp1252') libhdf5: None libnetcdf: None xarray: 2024.3.0 pandas: 2.2.0 numpy: 1.26.4 scipy: 1.12.0 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.18.0 cftime: None nc_time_axis: None iris: None bottleneck: None dask: 2024.4.1 distributed: 2024.4.1 matplotlib: 3.8.2 cartopy: None seaborn: None numbagg: None fsspec: 2024.3.1 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 24.0 conda: None pytest: None mypy: None IPython: 8.21.0 sphinx: None
welcome[bot] commented 3 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

ghiggi commented 2 months ago

Hi @cpegel

I just stumbled on the same problem while rechunking some large datasets.

This also occurs by just doing ds = xr.open_zarr(zarr_path, decode_cf=False)!

I think fixing this is very relevant! One or two years ago, I was using the decode_cf=False, mask_and_scale=False options when rechunking datasets several terabytes in size using the rechunker package. These arguments allowed me to reduce the overall size of the rechunking tasks by a factor of 2 when data was smartly saved as scaled integers. I was not encountering any problems back then.

So something must have changed in the past year or so!

ghiggi commented 2 months ago

MVCE four our use case:

import xarray as xr

zarr_src_path = "/ltenas8/tmp/storage_src.zarr"
zarr_dst_path = "/ltenas8/tmp/storage_rechunked.zarr"

# Create a dataset and write to Zarr storage
ds = xr.Dataset(dict(A=xr.DataArray([1.0])))
ds = ds.assign_coords({"dim_0": [1]})
ds["A"].encoding = {"dtype": "uint16", 
                    "scale_factor": 1.0, 
                    "add_offset": 0.0,
                    "_FillValue": 65535}
ds["dim_0"].encoding = {"dtype": "uint16", 
                        "scale_factor": 1.0, 
                        "add_offset": 0.0,
                        "_FillValue": 65535}
ds.to_zarr(zarr_src_path, mode='w')

# Read the dataset from a Zarr Store (using  `decode_cf=False, mask_and_scale=False`)
ds = xr.open_zarr(zarr_src_path, decode_cf=False, mask_and_scale=False)

# Do some rechunking
ds = ds.compute().chunk("auto")

# Write data with new chunking to another Zarr Store
ds.to_zarr(zarr_dst_path, mode='w')

# Add another variable to the Zarr Store
ds = ds.rename({"A": "B"}) # creation of another dummy variable 
ds = ds.compute().chunk("auto")

# This does not work if the coordinate has _FillValue, add_offset or scale_factor
ds["dim_0"].attrs
ds.to_zarr(zarr_dst_path, mode='a')

# This work if we remove _FillValue / add_offset / scale_factor from coordinates 
# --> Also if variable B has _FillValue / add_offset / scale_factor in the attributes !
ds["dim_0"].attrs = {}
ds["B"].attrs # this encodings are allowed  {'add_offset': 0.0, 'scale_factor': 1.0, '_FillValue': 65535}
ds.to_zarr(zarr_dst_path, mode='a')
kmuehlbauer commented 2 months ago

@ghiggi Can you explain why you are using packed data (add_offset, scale_factor) without any packing? Why not use uint16 in the first place, instead of float?

ghiggi commented 2 months ago

Hey @kmuehlbauer. I am not sure I understand your question. I had a long day. What do you refer with 'without any packing?'. Specifying the encoding as in the MVCE automatically pack the data and cast the source floating variables to uint16 when writing to disk. If the user open the dataset with decode_cf=True it will get the float variables (approximated to 2,3 or so decimals) with invalid values masked to np.nan ...

kmuehlbauer commented 2 months ago

@ghiggi With scale_factor=1.0 and add_offset=0 this packing is a no-op:

unpacked = packed * scale_factor + add_offset

This means, that any decimal places which are there in the unpacked data are effectively removed when packing. I can't see how you get back any decimal places when unpacking. Maybe I'm missing something, though.

But nevertheless I think I've detected the root cause of the issue:

https://github.com/pydata/xarray/blob/1f3bc7e96ec0b0c32250ec3f8e7e8a2d19d9a306/xarray/backends/zarr.py#L674-L698

In this block the existing variables in the store are decoded (even if the dataset was opened with decode_cf=False) and their encoding overrides the encoding on the appended data. Thus, that undecoded data is treated as decoded which creates that error. I'm not sure that removing the attrs as suggested in the raised error is the right fix for this, as then the undecoded data is encoded a second time.

I do not immediately see a solution to this. Maybe just checking if the data to be written has encoding specified and only try to encode in those cases is enough?

ghiggi commented 2 months ago

Hey @kmuehlbauer. Shame on me. I was to tired yesterday to note that in the MCVE I provided I unconsciously specified scale_factor and add_offset that does not actually pack the data :-P For my use case, I got around the issue by just decoding the coordinates using the relevant options in xr.decode_cf. I am neither sure what is the best solution to solve the issue, but thanks to have dived into it.

kmuehlbauer commented 2 months ago

@jhamman, @rabernat Your expertise is needed here. How should encoding/decoding be handled in the case of undecoded or partially decoded data? Should this raise a more meaningful error?

kmuehlbauer commented 2 months ago

Also connected to #5405.

Again, the issue is due to (unconditionally) encode variables which are already encoded. See comment above.

rabernat commented 2 months ago

Your expertise is needed here. How should encoding/decoding be handled in the case of undecoded or partially decoded data?

This is a tricky issue. From a fundamental point of view, this is challenging because there are really two layers of encoding: Xarray's encoding pipelines and the Zarr codecs. In Zarr world, codec handle all of the data transformations, while attrs are used to hold user-defined metadata. Xarray overloads the Zarr attrs with encoding information, and Zarr doesn't really know or understand this.

The opinion I have developed is that it would be best if we could push as much encoding as possible down to the Zarr layer, rather than Xarray. (This is already possible today for scale_factor and add_offset). When we originally implemented the Zarr backend, we just copied a lot from NetCDF, which delegated lots of encoding to Xarray. But, especially with Zarr V3, we have the ability to create new, custom codecs that do exactly what we want. CF-time encoding, for example, could be a Zarr codec.

In the case where Xarray is handling the encoding, I feel like we need a more comprehensive set of compatibility checks between the encoding attributes and the zarr attributes on disk in the existing store. Right now these are quite ad hoc. We need a way to clearly distinguish which zarr attributes are actually encoding parameters.

dopplershift commented 2 months ago

Though wouldn't using a custom codec, for writing, potentially create some interoperability challenges with other clients?

rabernat commented 2 months ago

"Custom codec" here means an official Zarr extension. We now have this possibility in Zarr V3. Yes, implementations would have to support it.

It's important to recognize that we are already essentially using such custom codecs (without any spec at the Zarr level) when we let Xarray handle the encoding. If a non-Xarray Zarr client opens data with cf-time encoding, or scale_factor and add_offset in the attrs, it will not be decoded properly. If we push this to the Zarr level, at least such clients will be able to raise an error rather than sending un-decoded data to the user.

dopplershift commented 2 months ago

:+1: to official codec.

I agree it's good to formalize, make the user experience more robust. I don't think it's necessarily better to provide an error rather than undecoded data. Undecoded data can be introspected and allows the possibility for the user to manually handle. An error is a full stop, with no workaround. It is by definition, user-hostile, especially in the case where they're not responsible for creating the file.