h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

Integrate DuckDB #200

Closed hannes closed 3 years ago

hannes commented 3 years ago

This PR integrates DuckDB in the benchmark by way of its R package. Packages are installed from CRAN, queries are run using the Spark SQL queries.

CC @jangorecki

jangorecki commented 3 years ago

Thank you for PR, it looks great. It will help a lot to have duckdb in db-benchmark. I will make some adjustments, biggest I see now is to remove helper function so everything can be executed line by line. If I will have any questions I will ask them here. We need to wait couple days because benchmark machine is already occupied by another project.

jangorecki commented 3 years ago

I made new branch so it is easier to push to than to a remote "master".

jangorecki commented 3 years ago

Printing dimensions of an answer has to be included in the timing. This is to enforce lazy computation that other solutions might be using. I have taken out queries out of helper functions, it will look now like this: https://github.com/h2oai/db-benchmark/pull/201/commits/ac6c29da3b0eb34d067523c28159e7b870c25bd0 If there is a way to get nrow and ncol in a single query, then I could simplify that and reduce number of queries to db.

jangorecki commented 3 years ago

@hannesmuehleisen how are operations like SUM handled in case of integer overflow? if the type used for int is int32 (which is reasonable and should be preferred to not suffer extra penalty comparing to other tools), then when calculating checksum in bigger datasets (1e9) the overflow will happen, we normally use integer64 in those cases. Is there a support for pulling int64 types to R?

hannes commented 3 years ago

DuckDB decides what type the SUM result has based on column statistics. BIGINT/int64 is currently returned as double.

jangorecki commented 3 years ago

@hannesmuehleisen duckdb is now running. So far I noticed that groupby q8 is causing duckdb the biggest problem. Computing 1e8 data size for this query takes 1000s which puts duckdb for this particular query at the slowest position, after dask which needs ~700s to compute this query. Half of tools solve it under 30s. If you are looking for things to improve this one seems to be good candidate. Please shout if it doesn't match your local testing.

hannes commented 3 years ago

Yay! Q8 is much improved in our latest dev builds. Should resolve itself once that makes it to CRAN.

jangorecki commented 3 years ago

@hannesmuehleisen Good to hear that. If you have a release channel for your development version which is considered stable-ish - well tested so broken builds should not happen - we can plug that in to db-benchmark instead of using CRAN releases. This was initial idea for this project but most of solutions does not provide stable-ish devel releases. We currently support such for data.table and python datatable.

jangorecki commented 3 years ago

@hannesmuehleisen Benchmark finished.

groupby 1e9 could not be completed due to out of memory error. CSV size is slightly under 50GB size so machine with 128GB should be generally able to compute that in-memory. For now I will change duckdb to run groupby 1e9 using on-disk storage. After next benchmark run it should be reflected on the report. If you will reduce memory usage in future please let me know, then I will revert this change https://github.com/h2oai/db-benchmark/pull/201/commits/58ab74f6fc3d172214662d197bb807acbe529c64.

join 1e9 has been already attempted to be computed using on-disk memory but all 3 data cases of that size has been killed by OS oom killer.

$ more out/run_duckdb_join_J1_1e9_*.err
::::::::::::::
out/run_duckdb_join_J1_1e9_NA_0_0.err
::::::::::::::
Killed
::::::::::::::
out/run_duckdb_join_J1_1e9_NA_0_1.err
::::::::::::::
Killed
::::::::::::::
out/run_duckdb_join_J1_1e9_NA_5_0.err
::::::::::::::
Killed

In all 3 cases it happen when doing first join query, so data are being "loaded" properly. If there is any debug setting I can switch on to dump more logs somewhere I can do it, just tell me how.

jangorecki commented 3 years ago

@hannesmuehleisen I updated exceptions and duckdb is now on the report. If you have any preferences about colors we can adjust them. It is currently set as follows: https://github.com/h2oai/db-benchmark/blob/duckdb/_benchplot/benchplot-dict.R#L45

hannes commented 3 years ago

Great its now on the list! Wrt color, perhaps something that is a bit more readable? I don't really have a preference.

jangorecki commented 3 years ago

Does duckdb stores column statistics like mean? I see q4 is way much faster than any other solution.

hannes commented 3 years ago

It does not, we have some statistics like min/max or whether there are NULLs but those are only upper/lower bounds so we can plan better.

jangorecki commented 3 years ago

Any idea why q4 is so much faster?

hannes commented 3 years ago

Probably the perfect hashing we use when we know that there can only be few groups based on stats

hannes commented 3 years ago

Looks great, one thing I don't get is why the 5 GB group by q3 is so much slower than q5. it should be similar, no?

jangorecki commented 3 years ago

Grouping is made by a different column so one should expect difference.

jangorecki commented 3 years ago

duckdb timings in some cases are already very impressive, congratulation @hannesmuehleisen.

After 0.2.6 release (or if we decide to switch to devel) I will re-run duckdb, then groupby 1e9 will use on-disk storage, so duckdb 1e9 timings should be displayed on the report as well.

Be sure to check https://h2oai.github.io/db-benchmark/#explore-more-data-cases at the bottom of the report. There are more benchmark plots linked there, having different cardinality of id columns, missing values and being pre-sorted. You can also obtain all timings data by adding /time.csv (and /logs.csv) to report website url.

We also have https://h2oai.github.io/db-benchmark/history.html (duckdb will appear there soon) for tracking performance regression, but it is an internal only report, not meant to be published really. For public consumption only "latest" timings are meant to be presented on the main report, but for developers it can be very useful to look at history plots when they are presume performance regression.

Thanks for joining this project!

hannes commented 3 years ago

@jangorecki 0.2.6 made it to CRAN today, curious about the benchmark perf changes!

jangorecki commented 3 years ago

@hannesmuehleisen It is now running, so far I noticed that NA data case (G1_1e7_1e2_5_0) crashed during q10 with the following error

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 0.201 0 0.007
Execution halted
jangorecki commented 3 years ago

Sorry, I made me mistake when looking at logs. 1e7 q10 did finished. q10 run1 has been logged to timings, q10 run2 hasn't been logged so the error must have occurred during checksum computation.

Then on 1e8 NA data case I observed that second run of q10 haven't finished and crash was after first run of q10. Run1 has not been logged so error was again during checksum computation.

log outputs for completeness:

1e7

[1] 9216    3            ## q9 run1 dim
[1] 9216    3            ## q9 run2 dim
    id2 id4           r2 ## q9 ans head
1 id084  26 0.0026820621
2 id076  28 0.0006597825
3 id079  48 0.0004708965
    id2 id4          r2  ## q9 ans tail
1 id023  94 0.002061448
2 id029  64 0.003592627
3 id085   7 0.001361118
[1] 9999993       8      ## q10 run1 dim
[1] 9999993       8      ## q10 run2 dim
Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 0.201 0 0.007
Execution halted

1e8

[1] 9216    3            ## q9 run1 dim
[1] 9216    3            ## q9 run2 dim
    id2 id4           r2 ## q9 ans head
1 id050  89 8.477036e-05
2 id038   3 4.091803e-06
3 id063  79 6.383826e-05
    id2 id4           r2 ## q9 ans tail
1 id079  77 4.393387e-05
2 id029  64 2.185473e-04
3 id085   7 3.726668e-05
[1] 99999338        8    ## q10 run1 dim
Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 1.529 0.01 0.042
Execution halted
jangorecki commented 3 years ago

1e9 groupby which goes on-disk has been killed by OOM during q2.

# groupby-duckdb.R
loading dataset G1_1e9_1e2_0_0
[1] "using disk memory-mapped data storage"
[1] 1e+09
grouping...
[1] 100   2
[1] 100   2
    id1       v1
1 id039 30000168
2 id035 30003917
3 id080 29979880
    id1       v1
1 id067 29996312
2 id022 29994847
3 id086 30003608
Killed

@hannesmuehleisen If you believe that 0.2.6 release could improve 1e9 data size handling, which previously was running OOM, I can switch back to in-memory processing.

jangorecki commented 3 years ago

all joins fail early with the same internal error, after second question with the same

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Out of Range Error: SUM is out of range!
Calls: system.time ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Timing stopped at: 1.529 0.01 0.042
Execution halted
jangorecki commented 3 years ago

@hannesmuehleisen I filled 3 issues in duckdb repo, two of them are regression in 0.2.6. Sorry I haven't narrow down those to the minimal reproducible reports but already those three reports, tracking down, comparison, took a while.

hannes commented 3 years ago

Thanks for filing them, will address asap.