h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
320 stars 85 forks source link

update Julia benchmarks #232

Closed bkamins closed 1 year ago

bkamins commented 2 years ago

This PR objective is to finalize changes in https://github.com/h2oai/db-benchmark/pull/230 and also implemented some code changes:

  1. use loop instead of duplicating code for running the same command twice
  2. use of WeakRefStrings.jl instead of String or Symbol (as I get repeated requests to add it manually before CSV.jl gets 1.0 release when it will be the default)
  3. Clean-up of ; at end of line (so that it is present only for top-level statements)
  4. Consistent handling of pooling across join and grouping tasks
bkamins commented 2 years ago

@jangorecki - I opened this PR as we have not run the benchmarks for DataFrames.jl for some time (mostly because we do not have CSV.jl 1.0 release yet) and people are asking for re-running of the benchmarks. Thank you!

CC @nalimilan @quinnj @cmey

jangorecki commented 2 years ago

Thank you for PR. Could you please revert point 1 of this PR? There were ideas to use similar techniques in other tools but I would like to keep scripts reproducible interactively line by line, at least where it is possible (not possible for dask and clickhouse). This makes it easier to debug and also makes it easier to match corresponding code between solutions or languages. Note that this project is now maintained by @mattdowle

bkamins commented 2 years ago

Done - thank you!

@mattdowle - if you would have any additional questions or comments please let me know. In general in probably 1-2 months we will have a Julia 1.7 release and CSV.jl 1.0 release when I would open another PR to update the codes to reflect these changes.

bkamins commented 2 years ago

@quinnj - can you please review if the current setup of CSV reading is correct? I still have WeakRefStrings.jl as a dependency, since CSV.jl does not export the string types yet. Thank you!

quinnj commented 2 years ago

Yeah, LGTM

bkamins commented 2 years ago

@mattdowle - if you would plan to re-run the benchmark it would be great if this PR got merged before. Thank you!

PallHaraldsson commented 2 years ago

Is it premature to merge this and benchmark?

Julia 1.6.3 is out and 1.7 is close. I can't see here if it would use CSV.jl 0.9.x which I believe would provide a speedup. Julia 1.6.3 maybe not be faster than 1.6.2 (or 1.6.1), used here, but we might as well use the latest, so that people will not be in doubt. Do you know if the 1.7-rc1 is faster? Then maybe wait a bit longer for its release?

bkamins commented 2 years ago

I think the objective of this PR is to give the current state of DataFrames.jl speed (the versions of the packages etc. are always a moving target that is why in the past the benchmarks were regularly re-run).

Julia 1.6.3 is out

Indeed this PR requires an update - I will make the changes.

I can't see here if it would use CSV.jl 0.9.x

It would

Do you know if the 1.7-rc1 is faster?

Given the planned changes in 1.7 I do not see any significant reason for it to be faster. The compilation latency will be probably a bit lower, but it would not matter that much for large benchmarks (which are most interesting).

oscardssmith commented 2 years ago

Now that Julia 1.7 and DataFrames 1.3 are out, I think it is time to update and merge this.

bkamins commented 2 years ago

I do not think these benchmarks have been maintained since early summer 2021.

oscardssmith commented 2 years ago

@jangorecki can you confirm? If so, would you be willing to add a maintainer to the repo so the benchmark can continue to be updated? This is a really good set of benchmarks, and it would be a shame to lose it.

jangorecki commented 2 years ago

Unfortunately I am no longer working for H2O, therefore I don't have access to the repository to set new maintainer, as well as to the machine to run the benchmark. You may try to reach out @srisatish finding new maintainer.

bkamins commented 2 years ago

closing as it seems the benchmarks are not maintained anymore.

jangorecki commented 2 years ago

I think having PR open is fine. If project is unmaintained anymore, as it seems to be, then it will be best to have that mentioned on top of the readme file as there are multiple developers from different projects awaiting reply from h2o. I always suggested to contact h2o support but have no idea what's the status.

bkamins commented 2 years ago

I always suggested to contact h2o support but have no idea what's the status.

I have tried contacting H2O with no response.