WIP PR for review. This PR makes all the code paths related to the distributed executors effectively async. Async functions have been limited to compute/iter/run and shuffle related things, but futures are handled to the *_any conversion functions which require specialization to function properly. Effectively everything is async from the entry point at ' Executor::worker(self)`, and both for the local and distributed schedulers.
We still function over regular iterators though (not streams), so this doesn't break our previous sync API dramatically and this should be suffitient as most data will be received in large chunks when deserialized from tasks, cache, shuffle, etc. so there is not much to gain to converting to a stream based API.
One minor caveat is that Iterators must be now Send explicitly, but this was already the case implicitly as all the Data being passed around must be Send anyway. This allow for:
1) Seamlessly interop in the whole function stack when computing tasks, event hough not every func is async (as it not required).
2) More importantly, we don't require intermediate allocations and we can just chain iterators as previously.
There still some work to do:
[x] Debug an issue in the CoGroupedRdd code path that makes DAGS that depend on this op to fail (eg. join).
[x] Polish the executor itself initialization (there is a FIXME) .
[x] (Optional, can be done later) Removing threadpool initialization on the schedulers job run (this part can be also made async easily now), task polling etc.
[ ] (Optional, can be done later) Probably handling/optimizing a little better configuration of the async runtimes, task spawning etc.
WIP PR for review. This PR makes all the code paths related to the distributed executors effectively async. Async functions have been limited to compute/iter/run and shuffle related things, but futures are handled to the
*_any
conversion functions which require specialization to function properly. Effectively everything is async from the entry point at ' Executor::worker(self)`, and both for the local and distributed schedulers.We still function over regular iterators though (not streams), so this doesn't break our previous sync API dramatically and this should be suffitient as most data will be received in large chunks when deserialized from tasks, cache, shuffle, etc. so there is not much to gain to converting to a stream based API.
One minor caveat is that Iterators must be now
Send
explicitly, but this was already the case implicitly as all theData
being passed around must beSend
anyway. This allow for: 1) Seamlessly interop in the whole function stack when computing tasks, event hough not every func is async (as it not required). 2) More importantly, we don't require intermediate allocations and we can just chain iterators as previously.There still some work to do:
CoGroupedRdd
code path that makes DAGS that depend on this op to fail (eg. join).