eric-czech commented 3 years ago

This is an attempt at https://github.com/pangeo-data/rechunker/issues/45.

I'm not sure what the best way to go about this is, but I thought I would get something working and then get thoughts from you guys on where to go next. Notes:

The signature I'm exposing here is rechunk_dataset(source: Dataset, encoding: Mapping, max_mem, target_store, temp_store, executor). I'm using encoding to indicate the target chunkings, along with any other compressor/filter options, so there is some consistency with Dataset.to_zarr. I think it would probably be best if Dataset/DataArray were other possible options in the main rechunk function with the same target_chunks parameter and any other options in {target|temp}_options. For the sake of discussion I thought it was easier to review quickly if all the new code was in one place. I'll happily combine the functions if this is on the right track.
I'm probably missing some corner cases, but this seemed like a reasonable lift of the necessary functionality in the zarr backend. I tried to ensure that the condition validated at xarray/backends/zarr.py#L135 is also checked here. I'm not actually sure if that matters in this case. @rabernat / @TomAugspurger do you happen to know if I should need to worry about this? I.e. if the input dataset contains dask arrays backed by a zarr store and the target chunking would create new dask chunks that overlap multiple old zarr chunks, should rechunker throw an error like xarray does?
I'm encoding attributes using encode_zarr_attr_value. Should all of the rechunker functions be using this or something like it too?
I think there are several more tests that would be useful -- the test in this PR is just a start.

codecov[bot] commented 3 years ago

Codecov Report

Merging #52 into master will increase coverage by 2.75%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #52      +/-   ##
==========================================
+ Coverage   95.00%   97.75%   +2.75%     
==========================================
  Files          10       10              
  Lines         400      445      +45     
  Branches       78       88      +10     
==========================================
+ Hits          380      435      +55     
+ Misses         10        5       -5     
+ Partials       10        5       -5

Impacted Files	Coverage Δ
rechunker/api.py	`100.00% <100.00%> (+7.46%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8917e20...8502a33. Read the comment docs.

tomwhite commented 3 years ago

This looks great @eric-czech, thanks for working on it.

I think it would probably be best if Dataset/DataArray were other possible options in the main rechunk function

+1

shoyer commented 3 years ago

I'm encoding attributes using encode_zarr_attr_value. Should all of the rechunker functions be using this or something like it too?

I suspect this is only really relevant if the source is from xarray:

If the source is from zarr, attributes are already encoded properly
If the source is from dask, attributes don't exist

eric-czech commented 3 years ago

Ok I re-worked this one a good bit (apologies for the big changes since first review). Some notes on the latest commit at https://github.com/pangeo-data/rechunker/pull/52/commits/fc1b17a6629644ff678f3655dddfe65397b80fcc:

Added Datasets as a possible source in api.rechunk (there is no rechunk_dataset now)
Changed {temp|target}_options to be specific to individual variables and have a structure similar to target_chunks
Defaulted the temporary store location to somewhere in the system temp dir for Zarr groups and Xarray datasets
- It didn't make much sense IMO for temp_store to be optional in the API and not have a default be set for collections of arrays. Otherwise when it's not set, an assertion error was thrown if any one array is rechunked to a different size.
I skipped trying to support DataArrays because of https://github.com/pydata/xarray/issues/2660
The happy path for these changes would result from usage like this:

import zarr
import xarray as xr
import numpy as np
from rechunker.api import rechunk

shape = (100, 50)
ds = xr.Dataset(
    dict(
        a=(("x", "y"), np.ones(shape, dtype='f4')),
        b=(("x"), np.ones(shape[0])),
        c=(("y"), np.ones(shape[1]))
    ),
    coords=dict(
        cx=(("x"), np.ones(shape[0])),
        cy=(("y"), np.ones(shape[1]))
    )
).chunk(chunks=25)

rechunked = rechunk(
    ds,
    target_chunks=dict(a=(10, 10), b=(10,), c=(10,)),
    max_mem='50MB',
    target_store="/tmp/store.zarr",
    target_options=dict(
        a=dict(
            compressor=zarr.Blosc(cname="zstd"),
            dtype="int16",
            scale_factor=0.1,
            _FillValue=-9999,
        )
    )
)
print(rechunked)
<Rechunked>
* Source      : <xarray.Dataset>
Dimensions:  (x: 100, y: 50)
Coordinates:
    cx       (x) float64 dask.array<chunksize=(25,), meta=np.ndarray>
    cy       (y) float64 dask.array<chunksize=(25,), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    a        (x, y) float32 dask.array<chunksize=(25, 25), meta=np.ndarray>
    b        (x) float64 dask.array<chunksize=(25,), meta=np.ndarray>
    c        (y) float64 dask.array<chunksize=(25,), meta=np.ndarray>

* Intermediate: <zarr.hierarchy.Group '/'>

* Target      : <zarr.hierarchy.Group '/'>

rabernat commented 3 years ago

Thanks for all the hard work happening here!

I would rather require an explicit temp directory for now. My concern is that using a local directory as a default is likely to result in unexpected errors when scaling up rechunker for "production" use cases that run on multiple machines.

👍 to this. Our main use of rechunker is using dask in the cloud with object store, where there is no shared local filesystem. I'd like to avoid any default assumptions about the nature of the storage.

Going forward, maybe we could consider adding some sort of config system for rechunker, which would allow you to specify your preferred way of creating temporary storage.

eric-czech commented 3 years ago

Our main use of rechunker is using dask in the cloud with object store, where there is no shared local filesystem

Ah of course, makes sense.

In https://github.com/pangeo-data/rechunker/pull/52/commits/67ee2aa4674d94cbb11f18c04326504872745283, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both?

shoyer commented 3 years ago

In 67ee2aa, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both?

This sounds fine to me for now.

Long term, I do think it could make sense to pass an xarray.Dataset backed by multi-threaded dask into alternative executors, such as Beam. But this certainly isn't urgent.

eric-czech commented 3 years ago

In 67ee2aa, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both?

This sounds fine to me for now.

Ok, https://github.com/pangeo-data/rechunker/pull/52/commits/8502a33df6b5c5d5416b9b81ed268a2daa4c75e8 adds a similar error for dask array sources.

Long term, I do think it could make sense to pass an xarray.Dataset backed by multi-threaded dask into alternative executors, such as Beam. But this certainly isn't urgent.

I see, maybe the error in https://github.com/pangeo-data/rechunker/pull/52#discussion_r498457119 is actually pretty superficial? I can't tell whether or not that's hinting at a fundamental limitation.

eric-czech commented 3 years ago

Is there anything else you guys think I should address on this one?

eric-czech commented 3 years ago

Hey @shoyer sorry to keep bugging you about this one, but is there anything else you'd like me to change?

rabernat commented 3 years ago

Hi @eric-czech. Thanks for your work on this! And thanks for your patience.

I'm fine with merging now. I assume issues will come up as people try it out, and we can iterate as needed.

eric-czech commented 3 years ago

Thanks @rabernat! And for your suggestions @shoyer.

pangeo-data / rechunker

Add rechunking for Xarray datasets #52

Codecov Report