voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

TCP-H queries 11-21 #52

Closed paleolimbot closed 2 years ago

paleolimbot commented 3 years ago

Just a work-in-progress PR as I get these updated! (Flagging #50)

paleolimbot commented 2 years ago

Ok, officially stuck on Query 20. Some info that hopefully makes it easier to see what's going on!

Input functions for interactive use (requires an arrowbench setup):

library(dplyr)

input_func <- function(tbl) {
  arrow::read_parquet(
    glue::glue("{Sys.getenv('ARROWBENCH_LOCAL_DIR')}/data/{tbl}_1.parquet")
  )
}

input_func <- function(tbl) {
  arrow::open_dataset(
    glue::glue("{Sys.getenv('ARROWBENCH_LOCAL_DIR')}/data/{tbl}_1.parquet"), 
  )
}

Code to get the answer:

con <- DBI::dbConnect(duckdb::duckdb())
scale_factor <- 1
query_id <- 20
answer_psv <- DBI::dbGetQuery(
        con,
        paste0(
          "SELECT answer FROM tpch_answers() WHERE scale_factor = ",
          scale_factor,
          " AND query_nr = ",
          query_id,
          ";"
        )
      )
answer <- read.delim(textConnection(answer_psv$answer), sep = "|")

Query:

#  SELECT
  #      s_name,
  #      s_address
  #  FROM
  #      supplier,
  #      nation
  #  WHERE
  #      s_suppkey IN (
  #          SELECT
  #              ps_suppkey
  #          FROM
  #              partsupp
  #          WHERE
  #              ps_partkey IN (
  #                  SELECT
  #                      p_partkey
  #                  FROM
  #                      part
  #                  WHERE
  #                      p_name LIKE 'forest%')
  #                  AND ps_availqty > (
  #                      SELECT
  #                          0.5 * sum(l_quantity)
  #                      FROM
  #                          lineitem
  #                      WHERE
  #                          l_partkey = ps_partkey
  #                          AND l_suppkey = ps_suppkey
  #                          AND l_shipdate >= CAST('1994-01-01' AS date)
  #                          AND l_shipdate < CAST('1995-01-01' AS date)))
  #              AND s_nationkey = n_nationkey
  #              AND n_name = 'CANADA'
  #          ORDER BY
  #              s_name;

Current code:

  supplier_ca <- input_func("supplier") %>%
    inner_join(
      input_func("nation") %>% filter(n_name == "CANADA"),
      by = c("s_nationkey" = "n_nationkey")
    ) %>%
    select(s_suppkey, s_name, s_address)

  part_forest <- input_func("part") %>%
    filter(grepl("^forest", p_name))

  partsupp_forest_ca <- input_func("partsupp") %>%
    semi_join(supplier_ca, c("ps_suppkey" = "s_suppkey")) %>%
    semi_join(part_forest, by = c("ps_partkey" = "p_partkey"))

  qty_threshold <- input_func("lineitem") %>%
    filter(
      l_shipdate >= as.Date("1994-01-01"),
      l_shipdate < as.Date("1995-01-01")
    ) %>%
    inner_join(partsupp_forest_ca, by = c("l_partkey" = "ps_partkey")) %>%
    group_by(l_partkey, ps_suppkey) %>%
    summarise(qty_threshold = 0.5 * sum(l_quantity), .groups = "drop")

  partsupp_forest_ca_filtered <- partsupp_forest_ca %>%
    inner_join(
      qty_threshold,
      by = c("ps_suppkey", "ps_partkey" = "l_partkey")
    ) %>%
    filter(ps_availqty > qty_threshold)

  supplier_ca %>%
    semi_join(partsupp_forest_ca_filtered, by = c("s_suppkey" = "ps_suppkey")) %>%
    select(s_name, s_address) %>%
    arrange(s_name) %>%
    collect()

I think what I'm up against is that qty_threshold is too small and I don't know why (but probably has to do with my turning a correlated subquery into a join). The key line filter(ps_availqty > qty_threshold) only removes one row, and it seems unlikley that "suppliers that are a good candidate for promotion" are almost all of them.

jonkeane commented 2 years ago

Thanks for this! I'll take a look at this later today. Do you mind if I push commits to your branch if I find anything?

paleolimbot commented 2 years ago

Yeah go for it!

jonkeane commented 2 years ago

Wow, 20 was, indeed, a doozy. It took me a lot of staring and reasoning about that correlated subquery to figure out what the issue was, and the change ended up being minor: https://github.com/ursacomputing/arrowbench/pull/52/commits/87e7b8ede661df95d4142e18317115b739026027 though getting there took a lot.

jonkeane commented 2 years ago

So far, I've tried to harmonize the format of collect()/saving outputs and a few other things (as well as removing kludges that aren't needed anymore), all of the queries work and validate, which is fantastic. I'll review them more deeply sometime today, and merge / enable them in conbench