When using DuckDB SQL (against Views of parquet files), for TPC-H query 9 at scale factor 1 the ordering is off. USA and UK are interleaved, when they should actually be UK then USA.
result vs answer
nation o_year sum_profit
result[152, ] SAUDI ARABIA 1994 48181673
result[153, ] SAUDI ARABIA 1993 45692556
result[154, ] SAUDI ARABIA 1992 48924913
- result[155, ] UNITED STATES 1998 27826594
+ answer[155, ] UNITED KINGDOM 1998 26366683
- result[156, ] UNITED KINGDOM 1998 26366683
+ answer[156, ] UNITED KINGDOM 1997 44518130
- result[157, ] UNITED KINGDOM 1997 44518130
+ answer[157, ] UNITED KINGDOM 1996 45539730
- result[158, ] UNITED STATES 1997 46638572
+ answer[158, ] UNITED KINGDOM 1995 46845879
- result[159, ] UNITED STATES 1996 46688281
+ answer[159, ] UNITED KINGDOM 1994 43081610
- result[160, ] UNITED KINGDOM 1996 45539730
+ answer[160, ] UNITED KINGDOM 1993 44770147
- result[161, ] UNITED STATES 1995 48951592
+ answer[161, ] UNITED KINGDOM 1992 44123403
- result[162, ] UNITED KINGDOM 1995 46845879
+ answer[162, ] UNITED STATES 1998 27826594
- result[163, ] UNITED STATES 1994 45099092
+ answer[163, ] UNITED STATES 1997 46638572
- result[164, ] UNITED KINGDOM 1994 43081610
+ answer[164, ] UNITED STATES 1996 46688281
- result[165, ] UNITED KINGDOM 1993 44770147
+ answer[165, ] UNITED STATES 1995 48951592
- result[166, ] UNITED STATES 1993 46181601
+ answer[166, ] UNITED STATES 1994 45099092
- result[167, ] UNITED STATES 1992 46168214
+ answer[167, ] UNITED STATES 1993 46181601
- result[168, ] UNITED KINGDOM 1992 44123403
+ answer[168, ] UNITED STATES 1992 46168214
result[169, ] VIETNAM 1998 27281931
result[170, ] VIETNAM 1997 48735914
result[171, ] VIETNAM 1996 47824596
library(duckdb)
library(arrowbench)
query_id <- 9
format <- "parquet"
scale_factor <- 1
Sys.setenv(ARROWBENCH_LOCAL_DIR = "~/repos/ab_store/het")
tpch_files <- ensure_source("tpch", scale_factor = scale_factor)
tpch_tables_needed <- tables_refed(tpc_h_queries[[query_id]])
con <- dbConnect(duckdb())
# set parallelism for duckdb
DBI::dbExecute(con, paste0("PRAGMA threads=10"))
for (name in tpch_tables_needed) {
file <- path.expand(tpch_files[[name]])
# have to create a VIEW in order to reference it by name
# This view is the most accurate comparison to Arrow, however it will
# penalize duckdb since AFAICT `parquet_scan` is not parallelized and
# ends up being the bottleneck
if (format == "parquet") {
sql_query <- paste0("CREATE OR REPLACE VIEW ", name, " AS SELECT * FROM parquet_scan('", file, "');")
} else if (format == "native") {
sql_query <- paste0("CREATE TABLE IF NOT EXISTS ", name, " AS SELECT * FROM parquet_scan('", file, "');")
}
DBI::dbExecute(con, sql_query)
}
tpc_h_queries <- list()
# TODO: should this actually just be query_id?
queries <- 1:22
# get the queries
tpc_h_queries <- lapply(queries, get_sql_query_func, con = con)
names(tpc_h_queries) <- queries
result <- tpc_h_queries[[query_id]](con = con)
result <- dplyr::as_tibble(result)
answer <- tpch_answer(scale_factor, query_id)
waldo::compare(result, answer, x_arg = "result", y_arg = "answer")
When using DuckDB SQL (against Views of parquet files), for TPC-H query 9 at scale factor 1 the ordering is off. USA and UK are interleaved, when they should actually be UK then USA.