rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 206 forks source link

Async executor #67

Closed iduartgomez closed 4 years ago

iduartgomez commented 4 years ago

WIP PR for review. This PR makes all the code paths related to the distributed executors effectively async. Async functions have been limited to compute/iter/run and shuffle related things, but futures are handled to the *_any conversion functions which require specialization to function properly. Effectively everything is async from the entry point at ' Executor::worker(self)`, and both for the local and distributed schedulers.

We still function over regular iterators though (not streams), so this doesn't break our previous sync API dramatically and this should be suffitient as most data will be received in large chunks when deserialized from tasks, cache, shuffle, etc. so there is not much to gain to converting to a stream based API.

One minor caveat is that Iterators must be now Send explicitly, but this was already the case implicitly as all the Data being passed around must be Send anyway. This allow for: 1) Seamlessly interop in the whole function stack when computing tasks, event hough not every func is async (as it not required). 2) More importantly, we don't require intermediate allocations and we can just chain iterators as previously.

There still some work to do: