more informative repr of rechunking plan

pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.

https://rechunker.readthedocs.io/

MIT License

163 stars 25 forks source link

more informative repr of rechunking plan #6

Open rabernat opened 4 years ago

rabernat commented 4 years ago

Right now we just return a delayed object. I think instead we should return an object called a RechunkingPlan. This object could expose useful parameters plus implement the dask dunder methods to allows us to write code like

plan = rechunk_zarr2zarr(...)
dask.compute(plan)

(Unfortunately I can't find the docs on those dunder methods.)

We could have an html repr with a table with information like

source_chunks
read_chunks
intermediate_chunks
write_chunks
target_chunks
max_mem
ntasks

etc.

TomAugspurger commented 4 years ago

(Unfortunately I can't find the docs on those dunder methods.)

I think https://docs.dask.org/en/latest/custom-collections.html has what we want.

TomAugspurger commented 4 years ago

I thought briefly about having that method return a dask.Array that would read from the eventual files. But that doesn't feel quite right, since .compute() would have a strange meaning (does it mean kick off the computation, or bring the results into memory, or both?)

rabernat commented 4 years ago

I thought briefly about having that method return a dask.Array that would read from the eventual files

:-1:

I think we should focus exclusively on storing data here. We should make it clear in the docs how to read the target data, but shouldn't actually return it.

Should we implement a custom collection?

TomAugspurger commented 4 years ago

You might try just subclassing Delayed first,

class RechunkingPlan(Delayed):
    @property
    def source_chunks(self):
         ...
    def _repr_html_(self):
         ...

I'm not 100% sure if this would work, but it'll be less effort than implementing a custom collection.