Dataset.chunk() does not overwrite encoding["chunks"]

Metamess commented 1 year ago

What happened?

When using the chunk function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation of chunk), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of the encoding attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).

Looking at the implementation of chunk, every variable is rechunked using the _maybe_chunk function, which actually has the parameter overwrite_encoded_chunks to control just this behavior. However, it is an optional parameter which defaults to False, and the call in chunk does not provide a value for this parameter, nor does it offer the caller to influence it (by having an overwrite_encoded_chunks parameter itself, for example).

I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of to_zarr:

Zarr chunks are determined in the following way: From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).

Which is exactly what it doesn't.

What did you expect to happen?

I would expect the "chunks" entry of the encoding attribute to be changed to reflect the new chunking scheme.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

# Create a test Dataset with dimension x and y, each of size 100, and a chunksize of 50
ds_original = xr.Dataset({"my_var": (["x", "y"], np.random.randn(100, 100))})
# Since 'chunk' does not work, manually set encoding
ds_original .my_var.encoding["chunks"] = (50, 50)
# To best showcase the real-life example, write it to file and read it back again.
# The same could be achieved by just calling .chunk() with chunksizes of 25, but this feels more 'complete'
filepath = "~/chunk_test.zarr"
ds_original.to_zarr(filepath)
ds = xr.open_zarr(filepath)

# Check the chunksizes and "chunks" encoding
print(ds.my_var.chunks)
# >>> ((50, 50), (50, 50))
print(ds.my_var.encoding["chunks"])
# >>> (50, 50)

# Rechunk the Dataset
ds = ds.chunk({"x": 25, "y": 25})
# The chunksizes have changed
print(ds.my_var.chunks)
# >>> ((25, 25, 25, 25), (25, 25, 25, 25))
# But the encoding value remains the same
print(ds.my_var.encoding["chunks"])
# >>> (50, 50)

# Attempting to write this back to zarr raises an error
ds.to_zarr("~/chunk_test_rechunked.zarr")
# NotImplementedError: Specified zarr chunks encoding['chunks']=(50, 50) for variable named 'my_var' would overlap multiple dask chunks ((25, 25, 25, 25), (25, 25, 25, 25)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 5.10.16.3-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.7 libnetcdf: 4.8.1 xarray: 2023.7.0 pandas: 1.5.3 numpy: 1.24.2 scipy: 1.10.0 netCDF4: 1.5.8 pydap: None h5netcdf: 0.12.0 h5py: 3.6.0 Nio: None zarr: 2.14.1 cftime: 1.5.2 nc_time_axis: None PseudoNetCDF: None iris: None bottleneck: 1.3.6 dask: 2022.01.0+dfsg distributed: 2022.01.0+ds.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.1.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 59.6.0 pip: 23.2.1 conda: None pytest: 7.2.2 mypy: 1.1.1 IPython: 7.31.1 sphinx: None

TomNicholas commented 1 year ago

I suspect this may be a regression introduced by #7019

Metamess commented 1 year ago

I have attempted to write a fix for this, see PR #8069, but some failing tests in test_backends seem to point to some further issues regarding the setting of the encoding attribute: test xarray/tests/test_backends.py::TestZarrDictStore::test_chunk_encoding_with_dask seems to directly assume that .chunk() does not set any encoding, and I did not want to overrule that assumption without further input

dcherian commented 1 year ago

Yes, we do not want to update encoding. We are planning on getting rid of it soon (https://github.com/pydata/xarray/issues/6323#issuecomment-1492098719) you should be able to use .reset_encoding() to get rid of the errors.

From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).

We should delete that reference to Dataset.chunk

Metamess commented 1 year ago

Ah, I see. I was not aware of the intention to deprecate the encoding attribute. However, I think the current state of the .chunk() function is still undesirable, as it leaves the Dataset in a 'broken' state.

I can think of 4 solutions to the current situation, and I am willing to edit my PR to fit whatever resolution is deemed best:

Set encoding["chunks"] as part of .chunk(). When the encoding attribute gets removed from the codebase, this behavior will be taken out again. Until then it would appear (to me, at least) as the 'expected' result of the .chunk() function, given the existence of encoding. (This is the current intent in my PR)
Only update encoding["chunks"] as part of .chunk() if it is already present on the variable. This would effectively only "fix" the encoding attribute where it exists. No encoding is added, but a Dataset with configured encoding["chunks"] does not suddenly lose it after passing through .chunk(). (Feels a bit inconsistent to me)
Remove any encoding["chunks"] entries when .chunk() is called. Provides a 'clean slate', where chunking is now defined by the new chunk sizes of the Dask arrays. Any existing values as part of encoding are made obsolete by calling .chunk() and are thus removed. Would work towards a future without encoding in terms of program flow. (This would be my preferred solution, given my current understanding of the state of things)
All encoding is removed when .chunks() is called, possibly by added a call to .reset_encoding() to the function. Aggressively moves towards a future without encoding. Has the (possibly significant) downside of removing any other properties stored in encoding, which would probably be unexpected behavior for users. (I would not be in favor of this)

My problem with the last option is also why calling .reset_encoding() after a call to .chunk() is undesirable to me, as in my use case there are more properties stored in encoding that I do not want to lose when changing the chunk sizes.

Regarding the proposed future without the encoding attribute, with any encoding stored in a separate object: Consider an application where a Dataset is created in one part, then goes through several stages of processing, and is finally written to a file. It would be a pain to have to pass an encoding object alongside it, in every function, just so that the encoding is not lost along the way, while it is only required at the end during the write stage. I do not expect to move the needle on the overall decision with this comment, but I hope it can serve as an argument for why a built-in solution that does not simply fully clear encoding may have some merit.

I will try and update my PR as soon as possible after further input :)

max-sixty commented 2 weeks ago

Closing as unlikely to inspire change, please reopen if anyone disagrees

pydata / xarray