pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Separate scheduling API from dask implementation #30

Closed shoyer closed 4 years ago

shoyer commented 4 years ago

xref #29

This PR moves the dask specific scheduling logic into a separate dask.py file, as a first step for adding support for alternative schedulers. (I'm particularly interested in supporting Apache Beam.)

The existing tests pass (with minor modifications), but the documentation still needs updating.

Notes:

codecov[bot] commented 4 years ago

Codecov Report

Merging #30 into master will increase coverage by 2.02%. The diff coverage is 96.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #30      +/-   ##
==========================================
+ Coverage   90.30%   92.33%   +2.02%     
==========================================
  Files           2        5       +3     
  Lines         196      274      +78     
  Branches       45       57      +12     
==========================================
+ Hits          177      253      +76     
  Misses         10       10              
- Partials        9       11       +2     
Impacted Files Coverage Δ
rechunker/api.py 90.26% <88.88%> (-2.87%) :arrow_down:
rechunker/executors/dask.py 100.00% <100.00%> (ø)
rechunker/executors/python.py 100.00% <100.00%> (ø)
rechunker/types.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 270a107...6732366. Read the comment docs.

rabernat commented 4 years ago

Stephan it is really great to get this PR. I'm so happy that you have found time to help contribute to rechunker.

Overall, I really like your design. I will try to find time in the next few days (realistically, early next week) for a more thorough review. In the meantime, maybe @TomAugspurger can have a look?

shoyer commented 4 years ago

Stephan it is really great to get this PR. I'm so happy that you have found time to help contribute to rechunker.

Thank you for releasing this tool in the first place! This fills an important niche for our current project, so I'm excited to be able to work with you on it.

rabernat commented 4 years ago

Just a thought that occurred to me last night: it would be awesome to implement a prefect scheduler as well.

shoyer commented 4 years ago

I implemented a second executor just using Python. It's only ~15 lines of code and should be a useful reference.

Looking at the two executors (Dask and Python), it felt like a class would be appropriate to codify the interface. So now we have a (very lightweight) Executor class.

shoyer commented 4 years ago

This is ready for a full review.

I'm intentionally not documenting the Executor class interface in the docs for now, because I suspect it will change as we write the next few executors. For now, anyone who wants to explore it should be willing to dive into the source code, and ideally would submit their Executor upstream into Rechunker itself!