Closed paleolimbot closed 2 years ago
Ok, officially stuck on Query 20. Some info that hopefully makes it easier to see what's going on!
Input functions for interactive use (requires an arrowbench setup):
library(dplyr)
input_func <- function(tbl) {
arrow::read_parquet(
glue::glue("{Sys.getenv('ARROWBENCH_LOCAL_DIR')}/data/{tbl}_1.parquet")
)
}
input_func <- function(tbl) {
arrow::open_dataset(
glue::glue("{Sys.getenv('ARROWBENCH_LOCAL_DIR')}/data/{tbl}_1.parquet"),
)
}
Code to get the answer:
con <- DBI::dbConnect(duckdb::duckdb())
scale_factor <- 1
query_id <- 20
answer_psv <- DBI::dbGetQuery(
con,
paste0(
"SELECT answer FROM tpch_answers() WHERE scale_factor = ",
scale_factor,
" AND query_nr = ",
query_id,
";"
)
)
answer <- read.delim(textConnection(answer_psv$answer), sep = "|")
Query:
# SELECT
# s_name,
# s_address
# FROM
# supplier,
# nation
# WHERE
# s_suppkey IN (
# SELECT
# ps_suppkey
# FROM
# partsupp
# WHERE
# ps_partkey IN (
# SELECT
# p_partkey
# FROM
# part
# WHERE
# p_name LIKE 'forest%')
# AND ps_availqty > (
# SELECT
# 0.5 * sum(l_quantity)
# FROM
# lineitem
# WHERE
# l_partkey = ps_partkey
# AND l_suppkey = ps_suppkey
# AND l_shipdate >= CAST('1994-01-01' AS date)
# AND l_shipdate < CAST('1995-01-01' AS date)))
# AND s_nationkey = n_nationkey
# AND n_name = 'CANADA'
# ORDER BY
# s_name;
Current code:
supplier_ca <- input_func("supplier") %>%
inner_join(
input_func("nation") %>% filter(n_name == "CANADA"),
by = c("s_nationkey" = "n_nationkey")
) %>%
select(s_suppkey, s_name, s_address)
part_forest <- input_func("part") %>%
filter(grepl("^forest", p_name))
partsupp_forest_ca <- input_func("partsupp") %>%
semi_join(supplier_ca, c("ps_suppkey" = "s_suppkey")) %>%
semi_join(part_forest, by = c("ps_partkey" = "p_partkey"))
qty_threshold <- input_func("lineitem") %>%
filter(
l_shipdate >= as.Date("1994-01-01"),
l_shipdate < as.Date("1995-01-01")
) %>%
inner_join(partsupp_forest_ca, by = c("l_partkey" = "ps_partkey")) %>%
group_by(l_partkey, ps_suppkey) %>%
summarise(qty_threshold = 0.5 * sum(l_quantity), .groups = "drop")
partsupp_forest_ca_filtered <- partsupp_forest_ca %>%
inner_join(
qty_threshold,
by = c("ps_suppkey", "ps_partkey" = "l_partkey")
) %>%
filter(ps_availqty > qty_threshold)
supplier_ca %>%
semi_join(partsupp_forest_ca_filtered, by = c("s_suppkey" = "ps_suppkey")) %>%
select(s_name, s_address) %>%
arrange(s_name) %>%
collect()
I think what I'm up against is that qty_threshold
is too small and I don't know why (but probably has to do with my turning a correlated subquery into a join). The key line filter(ps_availqty > qty_threshold)
only removes one row, and it seems unlikley that "suppliers that are a good candidate for promotion" are almost all of them.
Thanks for this! I'll take a look at this later today. Do you mind if I push commits to your branch if I find anything?
Yeah go for it!
Wow, 20 was, indeed, a doozy. It took me a lot of staring and reasoning about that correlated subquery to figure out what the issue was, and the change ended up being minor: https://github.com/ursacomputing/arrowbench/pull/52/commits/87e7b8ede661df95d4142e18317115b739026027 though getting there took a lot.
So far, I've tried to harmonize the format of collect()
/saving outputs and a few other things (as well as removing kludges that aren't needed anymore), all of the queries work and validate, which is fantastic. I'll review them more deeply sometime today, and merge / enable them in conbench
Just a work-in-progress PR as I get these updated! (Flagging #50)