rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 206 forks source link

Maybe use DataFusion and Apache Arrow as building blocks ? #119

Open constantOut opened 4 years ago

constantOut commented 4 years ago

There is a competing project called https://github.com/ballista-compute/ballista It is using DataFusion, I don't quite get it why Ballista examples include weird syntax for querying. I understand that distributed SQL execution is more complex then just combining results from individual executors, but I think having single-node SQL engine would be of a great help. What do you think ?

rajasekarv commented 4 years ago

I have plans of integrating with Python and possibly other languages(JVM and Go) using Arrow. However, regarding datafusion, the underlying architecture of this framework closely follows that of Spark and the job execution is quite a bit different than that of Datafusion. So, unfortunately we can’t use it. Andy Grove built ballista, a distributed framework around datafusion which is an interesting project to have a look at.