rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 207 forks source link

Count approximate (and partial jobs) #101

Closed iduartgomez closed 4 years ago

iduartgomez commented 4 years ago

WIP for partial jobs (and impl of count approximate)

close #93

iduartgomez commented 4 years ago

@rajasekarv this is ready to go. Tested all changes in disitrbuted mode as well as I had to do some refactoring around the scheduler. Will leave it open instead of merging in case you wanna check it.

There is one thing which is not yet well done which is the implementation of BoundDouble, it requires the inverse CFD of the Poisson distribution to find out the confidence range in one of the cases of the counter, and there is no pure Rust statistical library which implements it (and creating the numerical algorithm here would be beyond the scope of the issue/PR). I decided to merge it anyway as I was accumulating way too many changes, as the count is done and returned anyway and is not a breaking change and I may do a PR in the library I pulled (which looks like a good fit and candidate to use) in to implement it later.

This PR adds some extras, like a start of the joblistener (which will be useful in the future to add metrics etc), it wasn't strictly necessary but I got a bit carried away while implementing the stuff following the Spark codebase haha.

rajasekarv commented 4 years ago

@iduartgomez Awesome. As there are quite a substantial amount of additions and changes, let me have a look at it.