Closed Metamess closed 2 weeks ago
I suspect this may be a regression introduced by #7019
I have attempted to write a fix for this, see PR #8069, but some failing tests in test_backends seem to point to some further issues regarding the setting of the encoding attribute: test xarray/tests/test_backends.py::TestZarrDictStore::test_chunk_encoding_with_dask
seems to directly assume that .chunk()
does not set any encoding, and I did not want to overrule that assumption without further input
Yes, we do not want to update encoding. We are planning on getting rid of it soon (https://github.com/pydata/xarray/issues/6323#issuecomment-1492098719) you should be able to use .reset_encoding()
to get rid of the errors.
From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).
We should delete that reference to Dataset.chunk
Ah, I see. I was not aware of the intention to deprecate the encoding attribute. However, I think the current state of the .chunk()
function is still undesirable, as it leaves the Dataset in a 'broken' state.
I can think of 4 solutions to the current situation, and I am willing to edit my PR to fit whatever resolution is deemed best:
encoding["chunks"]
as part of .chunk()
. When the encoding attribute gets removed from the codebase, this behavior will be taken out again. Until then it would appear (to me, at least) as the 'expected' result of the .chunk()
function, given the existence of encoding
. (This is the current intent in my PR)encoding["chunks"]
as part of .chunk()
if it is already present on the variable. This would effectively only "fix" the encoding attribute where it exists. No encoding is added, but a Dataset with configured encoding["chunks"]
does not suddenly lose it after passing through .chunk()
. (Feels a bit inconsistent to me)encoding["chunks"]
entries when .chunk()
is called. Provides a 'clean slate', where chunking is now defined by the new chunk sizes of the Dask arrays. Any existing values as part of encoding
are made obsolete by calling .chunk()
and are thus removed. Would work towards a future without encoding
in terms of program flow. (This would be my preferred solution, given my current understanding of the state of things)encoding
is removed when .chunks()
is called, possibly by added a call to .reset_encoding()
to the function. Aggressively moves towards a future without encoding. Has the (possibly significant) downside of removing any other properties stored in encoding
, which would probably be unexpected behavior for users. (I would not be in favor of this)My problem with the last option is also why calling .reset_encoding()
after a call to .chunk()
is undesirable to me, as in my use case there are more properties stored in encoding
that I do not want to lose when changing the chunk sizes.
Regarding the proposed future without the encoding
attribute, with any encoding stored in a separate object: Consider an application where a Dataset is created in one part, then goes through several stages of processing, and is finally written to a file. It would be a pain to have to pass an encoding object alongside it, in every function, just so that the encoding is not lost along the way, while it is only required at the end during the write stage. I do not expect to move the needle on the overall decision with this comment, but I hope it can serve as an argument for why a built-in solution that does not simply fully clear encoding
may have some merit.
I will try and update my PR as soon as possible after further input :)
Closing as unlikely to inspire change, please reopen if anyone disagrees
What happened?
When using the
chunk
function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation ofchunk
), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of theencoding
attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).Looking at the implementation of
chunk
, every variable is rechunked using the_maybe_chunk
function, which actually has the parameteroverwrite_encoded_chunks
to control just this behavior. However, it is an optional parameter which defaults to False, and the call inchunk
does not provide a value for this parameter, nor does it offer the caller to influence it (by having anoverwrite_encoded_chunks
parameter itself, for example).I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of
to_zarr
:Which is exactly what it doesn't.
What did you expect to happen?
I would expect the "chunks" entry of the
encoding
attribute to be changed to reflect the new chunking scheme.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment