Closed Dandandan closed 2 years ago
Thanks for the feedback. I replaced the variables by ans
and combined them in one go. I am not sure if I find enough time for the write_log
this week.
@jangorecki if you work on it here some tips:
cargo run --release
. cargo run
executes it in debug mode, this makes the program very slow.lto = true
line in Cargo.toml
speeds up the binary, but slows the build down, so better to remove it while developing.Any idea if DataFusion supports queries that are required for advanced groupby questions? q6-q10..
I added query7 and query 10. The others I think needs features to be implemented (median, window functions, etc). Query 10 is ridiculously slow though, that will improve a bit once a PR has been merged, but probably will still be slow after that. The benchmarks showed that we have some more work to do! I also have an open PR that will improve performance on the easier queries, I think DataFusion might already be close for the group by query 1 and 4 to clickhouse / data.table or even CuDF.
The inner join queries should all work too I think (I might add them later, it is easy as I can just reuse the clickhouse queries). There is a known bug for left joins which gives wrong output.
@Dandandan anything i can do to help the finalize the work on this?
@Dandandan anything i can do to help the finalize the work on this?
Yeah sure, help appreciated!
I think what's missing is:
@Dandandan ok! Will check it out. Any additional info you could provide on what the standard flow is?
@Dandandan do you have a preference for how i push my updates here?
i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.
ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.
after the above ill start looking into the flow more.
right now these are the results i get when running the benchmarks:
group by
q1 took 56 ms
q2 took 289 ms
q3 took 1305 ms
q4 took 69 ms
q5 took 1158 ms
q7 took 1198 ms
q10 took 24691 ms
join
q1 took 261 ms
q2 took 367 ms
q3 took 334 ms
q4 took 507 ms
q5 took 1936 ms
@Dandandan do you have a preference for how i push my updates here?
i started the work here https://github.com/matthewmturner/db-benchmark/tree/datafusion/datafusion as a fork of what you were doing.
ive updated how the tables are created and added the join queries. still need to review them in more detail / make sure its correct, see if i can add any of the missing group by queries, and i assume we'll want to test the larger datasets as well - but let me know if you have any thoughts.
after the above ill start looking into the flow more.
right now these are the results i get when running the benchmarks:
group by q1 took 56 ms q2 took 289 ms q3 took 1305 ms q4 took 69 ms q5 took 1158 ms q7 took 1198 ms q10 took 24691 ms join q1 took 261 ms q2 took 367 ms q3 took 334 ms q4 took 507 ms q5 took 1936 ms
Thank you 💯. Maybe you could open a new PR with the combined changes so we can continue it there?
@Dandandan Sure sounds good!
Follow up PR:
WIP PR to add DataFusion (https://github.com/h2oai/db-benchmark/issues/107) as a solution.
@jangorecki is there any documentation regarding the required output? As this is a Rust solution, it can not easily reuse the prepared code.