rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 206 forks source link

Use this as backend for java Spark? #49

Closed Hoeze closed 4 years ago

Hoeze commented 4 years ago

As Spark uses RDD under the hood, would it be possible and make sense to use native_spark as the backend for the official java Spark version?

iduartgomez commented 4 years ago

This is unlikely, the in-memory data structure implementation is wildly different and it would require adaptation at both sides but more than anything work with the JVM and specially the JNI to translate the in-memory data layout from Rust to something interpretable by the JVM. It would be a lot of work to make this something useful. Take in mind we are already making long stretches so we can safely execute distributed code using the same Rust compiler for the same (but not other!) binary.

It would be fairly limiting on the Rust side too because we would require stable ABI and layout (usage of repr(C)) so we can safely cast types.

There is hope with the DataFrame API and we could leverage something like Arrow to make them both compatible. DF's typing is easier to work around and is not as flexible and already limiting itself so types can be easily translated from one language to an other using a language neutral in-memory representation (hence Arrow).

Coming back to RDD's, we may be able to pull some magic where in a fork of Spark, using an efficient binary message protocol (like Cap'n'proto) we could send data between processes, but again this would require from the users specification/description of the data types, which, at that point, would be probably more trouble than it's worth.

Hoeze commented 4 years ago

Thanks for your comprehensive answer. The reason why I'm asking is the question whether it would be possible to keep all the high-level Spark libraries like deltalake or Spark SQL but also benefit from the low-level advantages of native_spark.

iduartgomez commented 4 years ago

You are welcome! As said, maybe something could be worked out in the long if you are using the DataFrame API, and by extension Spark SQL. We probably would have to wrap Scala with a helper library but it may be possible.

On the other side, if you are using PySpark the chances you will be able to keep stuff as is and just start using this one as a backend by simply changing the import statement are much higher, initially at least, as we have more flexibility when interacting with Python.

iduartgomez commented 4 years ago

Closing this for now then.