Closed shoyer closed 4 years ago
Merging #30 into master will increase coverage by
2.02%
. The diff coverage is96.96%
.
@@ Coverage Diff @@
## master #30 +/- ##
==========================================
+ Coverage 90.30% 92.33% +2.02%
==========================================
Files 2 5 +3
Lines 196 274 +78
Branches 45 57 +12
==========================================
+ Hits 177 253 +76
Misses 10 10
- Partials 9 11 +2
Impacted Files | Coverage Δ | |
---|---|---|
rechunker/api.py | 90.26% <88.88%> (-2.87%) |
:arrow_down: |
rechunker/executors/dask.py | 100.00% <100.00%> (ø) |
|
rechunker/executors/python.py | 100.00% <100.00%> (ø) |
|
rechunker/types.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 270a107...6732366. Read the comment docs.
Stephan it is really great to get this PR. I'm so happy that you have found time to help contribute to rechunker.
Overall, I really like your design. I will try to find time in the next few days (realistically, early next week) for a more thorough review. In the meantime, maybe @TomAugspurger can have a look?
Stephan it is really great to get this PR. I'm so happy that you have found time to help contribute to rechunker.
Thank you for releasing this tool in the first place! This fills an important niche for our current project, so I'm excited to be able to work with you on it.
Just a thought that occurred to me last night: it would be awesome to implement a prefect scheduler as well.
I implemented a second executor just using Python. It's only ~15 lines of code and should be a useful reference.
Looking at the two executors (Dask and Python), it felt like a class would be appropriate to codify the interface. So now we have a (very lightweight) Executor
class.
This is ready for a full review.
I'm intentionally not documenting the Executor class interface in the docs for now, because I suspect it will change as we write the next few executors. For now, anyone who wants to explore it should be willing to dive into the source code, and ideally would submit their Executor upstream into Rechunker itself!
xref #29
This PR moves the dask specific scheduling logic into a separate
dask.py
file, as a first step for adding support for alternative schedulers. (I'm particularly interested in supporting Apache Beam.)The existing tests pass (with minor modifications), but the documentation still needs updating.
Notes:
staged_copy
into a single function, but perhaps there are other generic methods (execute
?) that would justify using a class?Rechunked
no longer inherits fromdask.delayed.Delayed
, and no longer has any dask specific logic at all. I think this is important for generic scheduler support, but it does means make it a little less reusable in larger pipelines._delayed
is currently a private attribute, but we should probably expose the scheduler equivalent of "delayed" objects in some way. I guess this is a use-case for class-based interface from the previous bullet.Rechunked
now always contains zarr arrays/groups rather than dask arrays. This makes the repr a little less informative, e.g., it no longer shows chunk size. This should probably be fixed before merging.staged_copy
supports any number of stages (in theory). That might be useful in the future, or it might be unnecessary complexity.