voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

Update the TPC-H answers to be compatible with the new data from DuckDB > 0.8 #136

Closed jonkeane closed 1 year ago

jonkeane commented 1 year ago

This PR changes the TPC-H answers to conform to data generated from DuckDB 0.8.

In https://github.com/duckdb/duckdb/pull/6535 the TPC-H data generation functions in DuckDB (which arrowbench uses to generate TPC-H data) changed subtly. You can see in that PR that their accepted answers changed for these. We initially only saw this show up on our ephemeral runners because they (re)generate data each run.

Right before we merge this, we should also pause the arrow benchmarks scheduling job and delete the cached tpc-h data on the two runners that have it: ursa-i9-9960x and test-mac-arm (I @jonkeane can handle doing that)

jonkeane commented 1 year ago

@thisisnic Apparently I can't add you as a reviewer directly in the UI?

jonkeane commented 1 year ago

I tested all of the TPC-H benchmarks (well for Arrow, and only multi-core for time, since those are the ones we run in CI) locally at scale factor 1 and 10 and they passed (with the exception of query 21 which runs out of memory on my laptop at scale factor 10, but has always done that!)

jgehrcke commented 1 year ago

:pray: Nice.

In https://github.com/duckdb/duckdb/pull/6535 the TPC-H data generation functions in DuckDB (which arrowbench uses to generate TPC-H data) changed subtly.

Explainable change, great! :)

jonkeane commented 1 year ago

And for posterity, my debugging process was:

Install latest arrow release + arrowbench (using standard devtools::install(".") or install from the github repo for arrowbench)

And then confirmed that I see the error with one of the queries that was failing:

library(arrowbench)

run_benchmark(
    tpc_h,
    query_id = 2,
    scale_factor = 1,
    cpu_count = 8
)

Which had the same validation error.

I then looked at the actual data that was being produced. This is a bit bespoke to the TPC-H benchmarks since we need to both load up the query + ensure the data exists and be able to reference it from a file, but the code below does that. get_query_func() returns a function that takes as an argument an input function (which can be generated using get_input_func() this ensures that you've got a function that runs + returns the result of a specific query using the input data from a specific data format + scale factor setup to use the engine specified.

current_results <- get_query_func(engine = "arrow", query_id = 2)(input_func = get_input_func(query_id = 2, format = "parquet", engine = "arrow", scale_factor = 1))

answer <- tpch_answer(scale_factor = 1, query_id = 2)

And you can see they are indeed differing in the s_comment column:

> waldo::compare(current_results, answer)
old vs new
                                                                                                        s_comment
- old[1, ]   l, ironic instructions cajole                                                                       
+ new[1, ]   uriously regular requests hag                                                                       
- old[2, ]   es. furiously silent deposits among the deposits haggle furiously a                                 
+ new[2, ]   efully express instructions. regular requests against the slyly fin                                 
- old[3, ]   ar, regular requests nag blithely special accounts. final deposits impress carefully. ironic,       
+ new[3, ]   etect about the furiously final accounts. slyly ironic pinto beans sleep inside the furiously       
- old[4, ]   s sleep according to the quick requests. carefully                                                  
+ new[4, ]   ackages boost blithely. blithely regular deposits c                                                 
- old[5, ]   against the slyly daring requests. unusual accounts wake atop the blithely spe                      
+ new[5, ]   etect blithely bold asymptotes. fluffily ironic platelets wake furiously; blit                      
- old[6, ]   into beans haggle at the quickly final asymptotes. unusu                                            
+ new[6, ]    regular accounts. furiously unusual courts above the fi                                            
- old[7, ]   into beans haggle at the quickly final asymptotes. unusu                                            
+ new[7, ]    regular accounts. furiously unusual courts above the fi                                            
- old[8, ]   ly daring excuses unwind carefully above the fu                                                     
+ new[8, ]   rns wake final foxes. carefully unusual depende                                                     
- old[9, ]    slyly ironic, special requests. final instructions above the qu                                    
+ new[9, ]    the special excuses. silent sentiments serve carefully final ac                                    
- old[10, ]  odolites. blithely special requests above the regular foxes sleep unusual sauternes. care           
+ new[10, ]  ges. slyly regular requests are. ruthless, express excuses cajole blithely across the unu           
and 90 more ...

old$s_comment vs new$s_comment
- "l, ironic instructions cajole"
+ "uriously regular requests hag"
- "es. furiously silent deposits among the deposits haggle furiously a"
+ "efully express instructions. regular requests against the slyly fin"
- "ar, regular requests nag blithely special accounts. final deposits impress carefully. ironic,"
+ "etect about the furiously final accounts. slyly ironic pinto beans sleep inside the furiously"
- "s sleep according to the quick requests. carefully "
+ "ackages boost blithely. blithely regular deposits c"
- "against the slyly daring requests. unusual accounts wake atop the blithely spe"
+ "etect blithely bold asymptotes. fluffily ironic platelets wake furiously; blit"
- "into beans haggle at the quickly final asymptotes. unusu"
+ " regular accounts. furiously unusual courts above the fi"
- "into beans haggle at the quickly final asymptotes. unusu"
+ " regular accounts. furiously unusual courts above the fi"
- "ly daring excuses unwind carefully above the fu"
+ "rns wake final foxes. carefully unusual depende"
- " slyly ironic, special requests. final instructions above the qu"
+ " the special excuses. silent sentiments serve carefully final ac"
- "odolites. blithely special requests above the regular foxes sleep unusual sauternes. care"
+ "ges. slyly regular requests are. ruthless, express excuses cajole blithely across the unu"
and 90 more ...

Now that I realized that the query was running + the answers I was getting looked very similar except for the random text, I went to the duckdb repo and git blamed their answers there: https://github.com/duckdb/duckdb/tree/master/extension/tpch/dbgen/answers saw "regenerate answers" as a commit message and that lead to the PR up at the top. So I knew we would also need to regenerate our answers.

To do that, I used the helper script https://github.com/voltrondata-labs/arrowbench/blob/main/inst/tpch-answer-gen.R (where I switched out the sf <- 1 to sf <- 0.01, sf <- 0.1, and sf <- 10 to regenerate answers for different scale factors). That script will also run the queries using arrow + duckdb + dplyr and compare them all for correctness. There are a few gotchas there though (all in comments): there are slight differences for the tiny scale factors between duckdb and arrow. I've never totally dug into it, but I believe it's decimal precision / rounding rules. We don't care too much about those, though they are useful for testing so we keep them around. And dplyr will struggle at scale factor 10 so I comment it out. Then moved those newly regenerated answers over into the correct answer folder in inst (in this case inst/tpch/answers_duckdb_data/)

Once I had new answers in place, I reinstalled arrowbench and then re-ran the benchmark to see if it worked (it did), then ran a full run of all the queries with:

library(arrowbench)

run_benchmark(
    tpc_h,
    scale_factor = c(1, 10),
    cpu_count = 8
)

To confirm that those all ran + validated locally.