h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

Add Rust's DataFusion (arrow) #107

Open andygrove opened 4 years ago

andygrove commented 4 years ago

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.

DataFusion supports projection, selection, and simple aggregate queries.

https://github.com/apache/arrow/tree/master/rust/datafusion

jangorecki commented 4 years ago

Thanks for filling the request. I would appreciate if someone could ping me here when it will support joins.

andygrove commented 4 years ago

Here are latest benchmarks for GROUP BY and I think this is mature enough to consider adding here, but it doesn't support JOIN yet. Is that a prerequisite to getting it on this site?

https://andygrove.io/rust_bigdata_benchmarks/

jangorecki commented 4 years ago

Definitely not a prerequisite. Looks competitive. Should one expect to see similar performance comparing to other tools that uses Arrow as a backend? Then we would benchmarking Arrow via its Rust interface. Still make sense, just asking to for better understanding.

andygrove commented 4 years ago

That's a good question and I don't really have a good answer. The only other Arrow based query engines that I know about is Dremio (Java-based) and it would be interesting to benchmarks for that too. The Arrow project is in the process of building a C++ query engine but AFAIK that isn't ready yet.

Good to hear that join support is optional. I expect DataFusion will support joins eventually but its not the highest priority right now.

On Sun, Oct 20, 2019 at 12:45 PM Jan Gorecki notifications@github.com wrote:

Definitely not a prerequisite. Looks competitive. Should one expect to see similar performance comparing to other tools that uses Arrow as a backend? Then we would benchmarking Arrow via its Rust interface. Still make sense, just asking to for better understanding.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/h2oai/db-benchmark/issues/107?email_source=notifications&email_token=AAHEBRCDVJXL27JX7FVZJBDQPSRNTA5CNFSM4JBFJOG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYQ4LA#issuecomment-544280108, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHEBRF7K46YNT5FYV3TYQTQPSRNTANCNFSM4JBFJOGQ .

andygrove commented 3 years ago

Thanks for filling the request. I would appreciate if someone could ping me here when it will support joins.

@jangorecki FYI DataFusion 3.0.0 (due to be released any day now) now supports joins

jangorecki commented 3 years ago

@andygrove Thanks for update. Note that recently another rust-based solution was merged, Polars. The process was very smooth because the author of Polars submitted groupby and join benchmark scripts in PR. This helped a lot. Writing those scripts properly is not an easy job because I need not only to figure out how to answer questions, but how to answer questions in the most performant way.

MrPowers commented 1 year ago

@jangorecki - can I submit a pull request with the DataFusion script to help with the process?

andygrove commented 1 year ago

If it helps, we could even publish a specific rust crate containing the datafusion h2o benchmarks.

c21 commented 1 year ago

It would be great to add DataFusion to the benchmark!