Closed jonkeane closed 2 years ago
The tests that you added pass on my machine as well (although a few others fail on my machine both on main and on this PR). The only thing I can find that's wonky is:
arrowbench::tpch_answer(10, 12)
#> Error: IOError: Failed to open local file ''
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/io/file.cc:110 ::arrow::internal::FileOpenReadable(file_name_)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/io/file.cc:472 file_->OpenReadable(path)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/io/file.cc:638 result->memory_map_->Open(path, mode). Detail: [errno 2] No such file or directory
Aaah yes, query 12 doesn't work on any machine I've tried at scale factor 10 (so I don't have an answer there yet, I probably could + should craft an answer parquet for that so that it's there when the query engine is fixed to run it). I think that is a query engine problem (so will eventually be an apache/arrow jira) but I haven't gotten to the bottom of what the issue is, so haven't created one yet. Would you mind taking a look and seeing if you can get a reprex for that query and make a Jira for it?
If it's out of scope for this PR I can make a JIRA, but I modded your query 12 to be this and it runs quite quickly (and validates at scale 1). Does this help to generate the answer you need?
tpc_h_queries[[12]] <- function(input_func, collect_func = dplyr::collect) {
input_func("lineitem") %>%
filter(
l_shipmode %in% c("MAIL", "SHIP"),
l_commitdate < l_receiptdate,
l_shipdate < l_commitdate,
l_receiptdate >= as.Date("1994-01-01"),
l_receiptdate < as.Date("1995-01-01")
) %>%
inner_join(
input_func("orders"),
by = c("l_orderkey" = "o_orderkey")
) %>%
group_by(l_shipmode) %>%
summarise(
high_line_count = sum(
if_else(
(o_orderpriority == "1-URGENT") | (o_orderpriority == "2-HIGH"),
1L,
0L
)
),
low_line_count = sum(
if_else(
(o_orderpriority != "1-URGENT") & (o_orderpriority != "2-HIGH"),
1L,
0L
)
)
) %>%
ungroup() %>%
arrange(l_shipmode) %>%
collect_func()
}
Oh fantastic, if this works that would be even better. You "just" moved lineitem to be the first table, did all the filtering there then joined after instead of starting with orders and then attempting to join to the filtered lineitem to that, yeah?
I'll use this new code here, but it's a bit funny that we get a crash in the first case — so we probably still want to dig into that and create a Jira for that (even if the fix is "just" to add a gate that stops someone from doing whatever is causing the crash in the first — I don't think it's as simple as a right-hand table being filtered, cause we have other examples of that).
I'll put it on my "to reprex" list and open a JIRA! I'm guessing it's an out-of-memory error since it's a big join of two giant tables (or was before I filtered lineitem before the join).
Yeah, absolutely. Thanks again for that quick fix + review
This also has a bunch of clean up / useability enhancements:
tpch_answer()
function for retrieving an answer