h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

DataFusion solution [WIP] #182

Closed Dandandan closed 2 years ago

Dandandan commented 3 years ago

WIP PR to add DataFusion (https://github.com/h2oai/db-benchmark/issues/107) as a solution.

@jangorecki is there any documentation regarding the required output? As this is a Rust solution, it can not easily reuse the prepared code.

Dandandan commented 3 years ago

Thanks for the feedback. I replaced the variables by ans and combined them in one go. I am not sure if I find enough time for the write_log this week.

@jangorecki if you work on it here some tips:

jangorecki commented 3 years ago

Any idea if DataFusion supports queries that are required for advanced groupby questions? q6-q10..

Dandandan commented 3 years ago

I added query7 and query 10. The others I think needs features to be implemented (median, window functions, etc). Query 10 is ridiculously slow though, that will improve a bit once a PR has been merged, but probably will still be slow after that. The benchmarks showed that we have some more work to do! I also have an open PR that will improve performance on the easier queries, I think DataFusion might already be close for the group by query 1 and 4 to clickhouse / data.table or even CuDF.

The inner join queries should all work too I think (I might add them later, it is easy as I can just reuse the clickhouse queries). There is a known bug for left joins which gives wrong output.

matthewmturner commented 2 years ago

@Dandandan anything i can do to help the finalize the work on this?

Dandandan commented 2 years ago

@Dandandan anything i can do to help the finalize the work on this?

Yeah sure, help appreciated!

I think what's missing is:

matthewmturner commented 2 years ago

@Dandandan ok! Will check it out. Any additional info you could provide on what the standard flow is?

matthewmturner commented 2 years ago

@Dandandan do you have a preference for how i push my updates here?

i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.

ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.

after the above ill start looking into the flow more.

right now these are the results i get when running the benchmarks:

group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms

join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms
Dandandan commented 2 years ago

@Dandandan do you have a preference for how i push my updates here?

i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.

ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.

after the above ill start looking into the flow more.

right now these are the results i get when running the benchmarks:

group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms

join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms

Thank you 💯. Maybe you could open a new PR with the combined changes so we can continue it there?

matthewmturner commented 2 years ago

@Dandandan Sure sounds good!

Dandandan commented 2 years ago

Follow up PR:

https://github.com/h2oai/db-benchmark/pull/240