h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
326 stars 88 forks source link

Add JuliaDB #63

Closed skanskan closed 4 years ago

skanskan commented 5 years ago

It would be great to see also the results of JuliaDB or FastGroupBy.

jangorecki commented 5 years ago

FastGroupBy is unlikely due to https://github.com/xiaodaigh/FastGroupBy.jl/issues/7#issuecomment-425861502 as for JuliaDB, is it faster than DataFrames? we generally want in-memory tools as they will be faster, out of memory tools make sense when we scale up benchmark to data sizes that cannot fit into memory.

skanskan commented 5 years ago

Sometimes is faster https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313

jangorecki commented 5 years ago

Ive seen this thread. I expect fastgroupby to be re-used in dataframes when it will be mature enough. Julia folks are very responsive and well oriented in algos, I think they should catch up with speed in coming releases. If someone is willing to PR script for JuliaDB I will be happy to try it out, but if we focus on speed, so in-memory computation, then dataframes should be preferred over JuliaDB, at least in theory.

skanskan commented 5 years ago

I don't know how it works but maybe if the data is small enough JuliaDB will use in-emory algos.

jangorecki commented 4 years ago

Even if it uses in-memory algos, in theory it shouldn't be faster than dataframes. At least when we will assume those projects are cooperating with each other. Ideally I would like juliadb to be re-used internally in dataframes when it runs out of memory. I will close this request for now, we can always re-open it in future. What would be escpecially encouring to re-open it are timings showing it is faster on our db-benchmark questions than DataFrames.jl.