pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Skip writing fill-value-only chunks #94

Closed forman closed 2 years ago

forman commented 3 years ago

It would be really nice if rechunker could exploit the Zarr feature that chunks comprising only fill-values may be missing entirely, given that a fill_value is defined for that array. With a new configuration parameter, e.g. drop_fill_value_chunks, rechunker could be configured not to write a target chunk at all, given all of its values equal fill_value.

In our xcube package we have a dedicated tool called xcube prune for this purpose. But it may be slow for large datasets because it tests and deletes chunks sequentially.

d-v-b commented 2 years ago

this new feature added to zarr might be of interest: https://github.com/zarr-developers/zarr-python/pull/738

DanielLumb commented 2 years ago

I'm new to rechunker and immediately on initial testing I stumbled upon the issue of rechunker always writing empty chunks. My zarr array are pretty much sparse and since zarr 0.11 it was a great performance boost, when zarr supports ability to skip writing empty chunks.

I expected that (after zarr 0.11) libraries like rechunker would actually skip writing empty chunks quite automatically/transparently, as this is being handled by zarr pretty automatically (at least when write_empty_chunks=False). With a bit of tweaking I was able to pass write_empty_chunks=False to rechunker as target_options, but it seems it was ignored and rechunker somehow insisted on writing empty chunks anyway (which in case of large arrays with 10-20% fill ratio kills performance considerably).

The question: is writing empty chunks vital/critical to how rechunker works and there's a deeper reason for that?

Thanks for any thoughts/pointers and thanks for this otherwise great library.

d-v-b commented 2 years ago

I'm not super familiar with the internals of rechunker, but it's possible that write_empty_chunks is being specified when an array is created, but ignored when the array is accessed at a later point in time. Unlike stuff like compressor and chunks, write_empty_chunks is not part of the zarr array metadata that gets written to json; instead, write_empty_chunks must be specified each time you create a new instance of zarr.Array. If rechunker isn't doing this, then it might explain what you are seeing.

DanielLumb commented 2 years ago

Yes, you seem to be perfectly right. I mistakenly assumed also that it was already the default in zarr, which is not the case.

With some tweaks I was able to get it to work for me, the only change necessary in zarr was in zarr.creation.py/empty(), which for some reason didn't accept a fill_value= in kwargs, instead supplying hard-coded None inside the function, which doesn't work with my data. With a quick fix to that (), and adding a few tweaks in rechunker (mainly adding more properties to rechunker.py/ZARR_OPTIONS), I'm already testing and it seems to work nicely.

If there's some good reason why zarr.empty() cannot take fill_value as argument, then maybe changing the way this function is used by rechunker could solve the issue.

Thank you very much for fast and precise hint! I'm not familiar with development process enough to make the necessary changes in the libraries, but can confirm it's only a few lines to fix this and that it works.

d-v-b commented 2 years ago

Glad to hear that was useful, and I don't see any reason why the routines in creation.py can't take write_empty_chunks as an argument. I can add this to an ongoing PR that addresses this in a different part of the zarr-python codebase.

DanielLumb commented 2 years ago

Thanks a lot! Just note that the modification I needed to make in zarr's creation.py was not about write_empty_chunks. In order to properly work, I needed to pass proper fill_value, which rechunker passed on to empty() in creation.py, which does not properly accept fill_value, instead enforcing fixed None value. That's the line I had to fix:

-def empty(shape, **kwargs):
+def empty(shape, fill_value=None, **kwargs):
-    return create(shape=shape, fill_value=None, **kwargs)
+    return create(shape=shape, fill_value=fill_value, **kwargs)
DanielLumb commented 2 years ago

And a suggestion for the maintainers of rechunker: if you would add write_empty_chunks and fill_value to allowed ZARR_OPTIONS - that makes rechunker able to happily create perfectly empty chunks with zarr 0.11+.

rabernat commented 2 years ago

Thanks for the feedback everyone! I understand that this is an important priority. This feature is being implemented in https://github.com/pangeo-data/rechunker/pull/119!