h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

name functions used for aggregations in Julia #165

Closed bkamins closed 3 years ago

bkamins commented 3 years ago

In Julia a recommended style is to define a function that is used for aggregation. I hope such a change is OK with the standards these benchmarks want to follow.

bkamins commented 3 years ago

@jangorecki - also the current version of Julia is 1.5.3 (the benchmarks have 1.5.0) - it is nothing major but this also probably can be changed (I would do it myself but I remember that there were several places where this had to be changed and I was afraid that I would forget something).

jangorecki commented 3 years ago

Thanks. Is it just style or there is something more about it? Having one liner is easier and more clear to present syntax on benchmark plot. We have a double liner example for python datatable join where first a key to be set on a table before joining: https://h2oai.github.io/db-benchmark/join/J1_1e7_NA_0_0_basic.png whenever possible I think it is better to use one liners.

bkamins commented 3 years ago

The issue is that you are executing statements in top-level code (global scope). Normally you write performance sensitive code in Julia within a functions (this affects performance a bit as it is related to how compilation is handled).

I have updated the proposal to use "one liner syntax" as originally, but wrapped in the function. Would this work for you (I am not sure how you are generating the labels for the plots).

bkamins commented 3 years ago

@nalimilan - could you just double-check please if I have not made a typo? Thank you!

jangorecki commented 3 years ago

I would prefer to have this function created inside @elapsed block, but then the whole change would be probably pointless.

t = @elapsed(range_fun(x) = combine(groupby(x, :id3), [:v1, :v2] => ((v1, v2) -> maximum(v1)-minimum(v2))) => :range_v1_v2); ANS = range_fun(x)); println(size(ANS)); flush(stdout);
bkamins commented 3 years ago

Yes - the point is not to compile the same thing twice (which happens in global scope). Anyway - if it is not feasible then just please close this PR. This is not something major.

bkamins commented 3 years ago

Also - DataFrames.jl 0.22.0 release is out but I assume package versions are upgraded automatically - right?

jangorecki commented 3 years ago

Yes, julia is upgrading automatically. For current moment cron job to run benchmark is disabled, once it will be enabled julia should get upgraded automatically on the next run.