Add validation to scale factors 0.01-10

jonkeane commented 2 years ago

This also has a bunch of clean up / useability enhancements:

tpch_answer() function for retrieving an answer
Move tpch queries to a separate file
Caught a few bugs with extensions, filenames, and dots
Caught a bug where changing the compression/format of a data source lost type information (e.g. reading a decimal in to R which then is saved as a float later)

paleolimbot commented 2 years ago

The tests that you added pass on my machine as well (although a few others fail on my machine both on main and on this PR). The only thing I can find that's wonky is:

arrowbench::tpch_answer(10, 12)
#> Error: IOError: Failed to open local file ''
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/io/file.cc:110  ::arrow::internal::FileOpenReadable(file_name_)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/io/file.cc:472  file_->OpenReadable(path)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/io/file.cc:638  result->memory_map_->Open(path, mode). Detail: [errno 2] No such file or directory

jonkeane commented 2 years ago

Aaah yes, query 12 doesn't work on any machine I've tried at scale factor 10 (so I don't have an answer there yet, I probably could + should craft an answer parquet for that so that it's there when the query engine is fixed to run it). I think that is a query engine problem (so will eventually be an apache/arrow jira) but I haven't gotten to the bottom of what the issue is, so haven't created one yet. Would you mind taking a look and seeing if you can get a reprex for that query and make a Jira for it?

paleolimbot commented 2 years ago

If it's out of scope for this PR I can make a JIRA, but I modded your query 12 to be this and it runs quite quickly (and validates at scale 1). Does this help to generate the answer you need?

tpc_h_queries[[12]] <- function(input_func, collect_func = dplyr::collect) {
  input_func("lineitem") %>%
    filter(
      l_shipmode %in% c("MAIL", "SHIP"),
      l_commitdate < l_receiptdate,
      l_shipdate < l_commitdate,
      l_receiptdate >= as.Date("1994-01-01"),
      l_receiptdate < as.Date("1995-01-01")
    ) %>%
    inner_join(
      input_func("orders"),
      by = c("l_orderkey" = "o_orderkey")
    ) %>%
    group_by(l_shipmode) %>%
    summarise(
      high_line_count = sum(
        if_else(
          (o_orderpriority == "1-URGENT") | (o_orderpriority == "2-HIGH"),
          1L,
          0L
        )
      ),
      low_line_count = sum(
        if_else(
          (o_orderpriority != "1-URGENT") & (o_orderpriority != "2-HIGH"),
          1L,
          0L
        )
      )
    ) %>%
    ungroup() %>%
    arrange(l_shipmode) %>%
    collect_func()
}

jonkeane commented 2 years ago

Oh fantastic, if this works that would be even better. You "just" moved lineitem to be the first table, did all the filtering there then joined after instead of starting with orders and then attempting to join to the filtered lineitem to that, yeah?

I'll use this new code here, but it's a bit funny that we get a crash in the first case — so we probably still want to dig into that and create a Jira for that (even if the fix is "just" to add a gate that stops someone from doing whatever is causing the crash in the first — I don't think it's as simple as a right-hand table being filtered, cause we have other examples of that).

paleolimbot commented 2 years ago

I'll put it on my "to reprex" list and open a JIRA! I'm guessing it's an out-of-memory error since it's a big join of two giant tables (or was before I filtered lineitem before the join).

jonkeane commented 2 years ago

Yeah, absolutely. Thanks again for that quick fix + review

voltrondata-labs / arrowbench

Add validation to scale factors 0.01-10 #55