rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 205 forks source link

Preliminary benchmarcks? #33

Open LifeIsStrange opened 4 years ago

LifeIsStrange commented 4 years ago

Everything is in the title, I understand that the project is young and it needs time to get faster than spark. I'm just asking the current state, out of curiosity.

iduartgomez commented 4 years ago

We need to get going first easy cluster deployment and submit before we can automatize and setup some benchmarks, however preliminary results are very promising: https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-implementation-of-spark-from-scratch-368373a29a5c

I believe where in the long run we can gain the most in huge memory savings which is a big deal for Spark-like applications and workloads. Also when we start to optimize a lot of improvements can be applied at a lower level, however an expert use could already, theoretically, exploit a lot of facts about running RDDs in a compiled to machine instructions language: access to finetuned memory layout for data structs and algorithms, SIMD instructions, even GPU computation (still a bit early on Rust, but is there).

You know, all the goodies a closer to the metal language give you (you could replicate somethings in the JVM but is painful probably).

rajasekarv commented 4 years ago

As @iduartgomez said, there is no additional cost associated with RDDs here unlike in Spark. Well, there are still some unnecessary allocations to keep everything in safe Rust as much as possible, but technically you can already optimize RDDs to achieve a pretty good performance.

Task serialization is bit costly here compared to Spark for now, and shuffle tasks when spilled to disk can be a lot slower here as the block manager is pretty naive(no compression,etc.,). But for CPU intensive tasks, you can get very good performance here even at this stage.

A lot of functionalities are missing now to make this easy for someone new to Rust. You can try in the local mode to see for yourself, but you need to write a lot of file reading logic by yourself. Maybe something like a naive PageRank algorithm is a good candidate for this benchmarking purpose where both versions will look almost similar.

rajasekarv commented 4 years ago

Please let us know if you feel that the Rust version is unreasonably slower. It would be useful for us and we could probably look into it.

iduartgomez commented 4 years ago

One area we could explore in the future is for the kind of map-like tasks where shuffling is not required and computation can be kept local just run the computations in each node local executors for their partition in a rayon-like threadpool, this way we could safely use references and avoid as much allocation as possible and serialization altogether.

In general in any sort of task that can be ran in parallel but locally we should exploit this fact and I believe it would be very beneficial. However we still have a lot of stuff to cover (both functionally and ergonomically) before we get to make improvements like those (although is the kind of stuff that makes the project exciting in the long run).