h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
320 stars 85 forks source link

Add Datafusion solution [updated] #240

Open matthewmturner opened 2 years ago

matthewmturner commented 2 years ago

Updated PR to get Datafusion added to benchmarks.

Right now missing group by queries 6,8, and 9. I am going to look into those missing queries and then start looking into the flow / required output.

Let me know if anything in particular would make your life easier to add this :)

One question - can someone just confirm that this will be able to be run with cargo? Similar to the work @Dandandan did (I picked up from there) I am running the queries with the below commands:

# Group By
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --bin groupby --release

# Join
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --bin join --release
matthewmturner commented 2 years ago

@Dandandan fyi took a first stab at group by q8.

q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms

results currently similar to spark

Dandandan commented 2 years ago

@Dandandan fyi took a first stab at group by q8.

q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms

results currently similar to spark

Nice! The spark solution has DESC ordering btw, I guess that's what we should use.

matthewmturner commented 2 years ago

@Dandandan FYI i migrated to the python bindings, should make integrating with their flow easier as im using the existing python helpers.

I still have to migrate the join suite.

let me know if any thoughts.

results below - something odd going on with Q10 maybe?

0.11225258399999993 # q1
0.695109333 # q2
2.932470125 # q3
0.07341450000000016 # q4
3.3075385419999996 # q5
2.9051008750000005 # q7
4.573697916 # q8
68.875322208 # q10
matthewmturner commented 2 years ago

@jangorecki ive made a number of updates including adding datafusion to some of your utilities / runners which will hopefully make your life easier.

would you be able to see how close this is?

one thing i havent been able to test locally is running against the larger datasets so im not sure if / what errors we may get on those. do you have a recommendation for how to handle?

thanks for your help!

matthewmturner commented 2 years ago

hi @jangorecki - just checking in on this and if there is anything i can do to help.

as some additional context, datafusion has / will soon have several new features that will improve our query coverage and likely performance. from your perspective would you rather we submit once those are all completed or can we get the current submission merged as is and iterate from there?

thanks!

jangorecki commented 2 years ago

I am no longer a maintainer of this project as I don't work for H2O anymore. I would start by contacting maintainer of the project to ensure that effort you are going to undertake will be merged in. H2O support is very helpful so you should not have problems about finding out who now takes care of the project. Aside from support channel you should also easily reach h2o on twitter etc. You can of course always make a fork and publish results of your fork as this is an open source project and there are no restrictions like this.

matthewmturner commented 2 years ago

@jangorecki thank you for your work on this and for letting us know :) i will reach out to H2O for support.