Open matthewmturner opened 2 years ago
@Dandandan fyi took a first stab at group by q8.
q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms
results currently similar to spark
@Dandandan fyi took a first stab at group by q8.
q1 took 62 ms q2 took 322 ms q3 took 1230 ms q4 took 61 ms q5 took 1242 ms q7 took 1262 ms q8 took 2733 ms q10 took 24071 ms
results currently similar to spark
Nice! The spark solution has DESC
ordering btw, I guess that's what we should use.
@Dandandan FYI i migrated to the python bindings, should make integrating with their flow easier as im using the existing python helpers.
I still have to migrate the join suite.
let me know if any thoughts.
results below - something odd going on with Q10 maybe?
0.11225258399999993 # q1
0.695109333 # q2
2.932470125 # q3
0.07341450000000016 # q4
3.3075385419999996 # q5
2.9051008750000005 # q7
4.573697916 # q8
68.875322208 # q10
@jangorecki ive made a number of updates including adding datafusion to some of your utilities / runners which will hopefully make your life easier.
would you be able to see how close this is?
one thing i havent been able to test locally is running against the larger datasets so im not sure if / what errors we may get on those. do you have a recommendation for how to handle?
thanks for your help!
hi @jangorecki - just checking in on this and if there is anything i can do to help.
as some additional context, datafusion has / will soon have several new features that will improve our query coverage and likely performance. from your perspective would you rather we submit once those are all completed or can we get the current submission merged as is and iterate from there?
thanks!
I am no longer a maintainer of this project as I don't work for H2O anymore. I would start by contacting maintainer of the project to ensure that effort you are going to undertake will be merged in. H2O support is very helpful so you should not have problems about finding out who now takes care of the project. Aside from support channel you should also easily reach h2o on twitter etc. You can of course always make a fork and publish results of your fork as this is an open source project and there are no restrictions like this.
@jangorecki thank you for your work on this and for letting us know :) i will reach out to H2O for support.
Updated PR to get Datafusion added to benchmarks.
Right now missing group by queries 6,8, and 9. I am going to look into those missing queries and then start looking into the flow / required output.
Let me know if anything in particular would make your life easier to add this :)
One question - can someone just confirm that this will be able to be run with cargo? Similar to the work @Dandandan did (I picked up from there) I am running the queries with the below commands: