pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Add rechunking for Xarray datasets #52

Closed eric-czech closed 3 years ago

eric-czech commented 3 years ago

This is an attempt at https://github.com/pangeo-data/rechunker/issues/45.

I'm not sure what the best way to go about this is, but I thought I would get something working and then get thoughts from you guys on where to go next. Notes:

codecov[bot] commented 3 years ago

Codecov Report

Merging #52 into master will increase coverage by 2.75%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #52      +/-   ##
==========================================
+ Coverage   95.00%   97.75%   +2.75%     
==========================================
  Files          10       10              
  Lines         400      445      +45     
  Branches       78       88      +10     
==========================================
+ Hits          380      435      +55     
+ Misses         10        5       -5     
+ Partials       10        5       -5     
Impacted Files Coverage Δ
rechunker/api.py 100.00% <100.00%> (+7.46%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8917e20...8502a33. Read the comment docs.

tomwhite commented 3 years ago

This looks great @eric-czech, thanks for working on it.

I think it would probably be best if Dataset/DataArray were other possible options in the main rechunk function

+1

shoyer commented 3 years ago
  • I'm encoding attributes using encode_zarr_attr_value. Should all of the rechunker functions be using this or something like it too?

I suspect this is only really relevant if the source is from xarray:

eric-czech commented 3 years ago

Ok I re-worked this one a good bit (apologies for the big changes since first review). Some notes on the latest commit at https://github.com/pangeo-data/rechunker/pull/52/commits/fc1b17a6629644ff678f3655dddfe65397b80fcc:

import zarr
import xarray as xr
import numpy as np
from rechunker.api import rechunk

shape = (100, 50)
ds = xr.Dataset(
    dict(
        a=(("x", "y"), np.ones(shape, dtype='f4')),
        b=(("x"), np.ones(shape[0])),
        c=(("y"), np.ones(shape[1]))
    ),
    coords=dict(
        cx=(("x"), np.ones(shape[0])),
        cy=(("y"), np.ones(shape[1]))
    )
).chunk(chunks=25)

rechunked = rechunk(
    ds,
    target_chunks=dict(a=(10, 10), b=(10,), c=(10,)),
    max_mem='50MB',
    target_store="/tmp/store.zarr",
    target_options=dict(
        a=dict(
            compressor=zarr.Blosc(cname="zstd"),
            dtype="int16",
            scale_factor=0.1,
            _FillValue=-9999,
        )
    )
)
print(rechunked)
<Rechunked>
* Source      : <xarray.Dataset>
Dimensions:  (x: 100, y: 50)
Coordinates:
    cx       (x) float64 dask.array<chunksize=(25,), meta=np.ndarray>
    cy       (y) float64 dask.array<chunksize=(25,), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    a        (x, y) float32 dask.array<chunksize=(25, 25), meta=np.ndarray>
    b        (x) float64 dask.array<chunksize=(25,), meta=np.ndarray>
    c        (y) float64 dask.array<chunksize=(25,), meta=np.ndarray>

* Intermediate: <zarr.hierarchy.Group '/'>

* Target      : <zarr.hierarchy.Group '/'>
rabernat commented 3 years ago

Thanks for all the hard work happening here!

I would rather require an explicit temp directory for now. My concern is that using a local directory as a default is likely to result in unexpected errors when scaling up rechunker for "production" use cases that run on multiple machines.

👍 to this. Our main use of rechunker is using dask in the cloud with object store, where there is no shared local filesystem. I'd like to avoid any default assumptions about the nature of the storage.

Going forward, maybe we could consider adding some sort of config system for rechunker, which would allow you to specify your preferred way of creating temporary storage.

eric-czech commented 3 years ago

Our main use of rechunker is using dask in the cloud with object store, where there is no shared local filesystem

Ah of course, makes sense.

In https://github.com/pangeo-data/rechunker/pull/52/commits/67ee2aa4674d94cbb11f18c04326504872745283, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both?

shoyer commented 3 years ago

In 67ee2aa, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both?

This sounds fine to me for now.

Long term, I do think it could make sense to pass an xarray.Dataset backed by multi-threaded dask into alternative executors, such as Beam. But this certainly isn't urgent.

eric-czech commented 3 years ago

In 67ee2aa, I removed the default temp store, added a better error when it's not present, and added a NotImplementedError when the source is Xarray and the executor is anything but dask. I think I should probably do the same for when the source is da.Array. Does this sound right to you both?

This sounds fine to me for now.

Ok, https://github.com/pangeo-data/rechunker/pull/52/commits/8502a33df6b5c5d5416b9b81ed268a2daa4c75e8 adds a similar error for dask array sources.

Long term, I do think it could make sense to pass an xarray.Dataset backed by multi-threaded dask into alternative executors, such as Beam. But this certainly isn't urgent.

I see, maybe the error in https://github.com/pangeo-data/rechunker/pull/52#discussion_r498457119 is actually pretty superficial? I can't tell whether or not that's hinting at a fundamental limitation.

eric-czech commented 3 years ago

Is there anything else you guys think I should address on this one?

eric-czech commented 3 years ago

Hey @shoyer sorry to keep bugging you about this one, but is there anything else you'd like me to change?

rabernat commented 3 years ago

Hi @eric-czech. Thanks for your work on this! And thanks for your patience.

I'm fine with merging now. I assume issues will come up as people try it out, and we can iterate as needed.

eric-czech commented 3 years ago

Thanks @rabernat! And for your suggestions @shoyer.