voltrondata-labs / arrowbench

R package for benchmarking
Other
13 stars 9 forks source link

Document which dplyr-based queries fail on DuckDB #54

Closed jonkeane closed 2 years ago

jonkeane commented 2 years ago

The queries all should work (since they use dplyr pipelines), though some of them error:

One can do so with:

library(arrowbench)
run_benchmark(
  tpc_h,
  engine = "duckdb",
  scale_factor = 1,
)
Running engine=duckdb query_id=1 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h

Running engine=duckdb query_id=2 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [=>---------------------------------------------------]   5% in  9s eta:  3m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error: Garbage collection 15 = 11+1+3 (level 0) ... 
42.7 Mbytes of cons cells used (62%)
10.1 Mbytes of vectors used (16%)
Garbage collection 16 = 12+1+3 (level 0) ... 
44.6 Mbytes of cons cells used (65%)
10.7 Mbytes of vectors used (17%)
Garbage collection 17 = 13+1+3 (level 0) ... 
46.2 Mbytes of cons cells used (67%)
11.2 Mbytes of vectors used (17%)
Garbage collection 18 = 14+1+3 (level 0) ... 
46.8 Mbytes of cons cells used (68%)
11.4 Mbytes of vectors used (18%)
Garbage collection 19 = 15+1+3 (level 0) ... 
47.0 Mbytes of cons cells used (69%)
11.4 Mbytes of vectors used (18%)
Garbage collection 20 = 16+1+3 (level 0) ... 
47.1 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Garbage collection 21 = 17+1+3 (level 0) ... 
47.2 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Garbage collection 22 = 18+1+3 (level 0) ... 
47.2 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Error in .local(conn, statement, ...) : 
  duckdb_prepare
Execution halted
Running engine=duckdb query_id=3 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=4 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=5 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=6 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=7 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [=============>---------------------------------------]  27% in 50s eta:  2m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Catalog Error: Table with name "nation" already exists!
Calls: run_bm ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Execution halted
Running engine=duckdb query_id=8 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [================>------------------------------------]  32% in  1m eta:  2m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Catalog Error: Table with name "nation" already exists!
Calls: run_bm ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Execution halted
Running engine=duckdb query_id=9 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [==================>----------------------------------]  36% in  1m eta:  2m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error: Garbage collection 15 = 11+1+3 (level 0) ... 
42.7 Mbytes of cons cells used (62%)
10.1 Mbytes of vectors used (16%)
Garbage collection 16 = 12+1+3 (level 0) ... 
44.6 Mbytes of cons cells used (65%)
10.7 Mbytes of vectors used (17%)
Garbage collection 17 = 13+1+3 (level 0) ... 
46.2 Mbytes of cons cells used (67%)
11.2 Mbytes of vectors used (17%)
Garbage collection 18 = 14+1+3 (level 0) ... 
46.9 Mbytes of cons cells used (68%)
11.4 Mbytes of vectors used (18%)
Garbage collection 19 = 15+1+3 (level 0) ... 
47.1 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Garbage collection 20 = 16+1+3 (level 0) ... 
47.1 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query SELECT "nation", "o_year", SUM("amount") AS "sum_profit"
FROM (SELECT "n_name" AS "nation", EXTRACT(year FROM "o_orderdate") AS "o_year", "l_extendedprice" * (1.0 - "l_discount") - "ps_supplycost" * "l_quan
Execution halted
Running engine=duckdb query_id=10 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=11 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=12 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=13 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [============================>------------------------]  55% in  2m eta:  1m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error: Garbage collection 15 = 11+1+3 (level 0) ... 
42.7 Mbytes of cons cells used (62%)
10.1 Mbytes of vectors used (16%)
Garbage collection 16 = 12+1+3 (level 0) ... 
44.6 Mbytes of cons cells used (65%)
10.7 Mbytes of vectors used (17%)
Garbage collection 17 = 13+1+3 (level 0) ... 
46.2 Mbytes of cons cells used (67%)
11.2 Mbytes of vectors used (17%)
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query SELECT "c_count", COUNT(*) AS "custdist"
FROM (SELECT "c_custkey", SUM(NOT((("o_orderkey") IS NULL))) AS "c_count"
FROM (SELECT "c_custkey", "c_name", "c_address", "c_nationkey", "c_phone", "c_acctbal", "c_mktsegment", "c_comment", "o_orderkey", "o_orderstatus", "o_totalprice", "o_orderdate", "o_orderpriority", "o_clerk", "o_shippriority", "o_comment"
FROM "customer" AS "LHS"
LEFT JOIN (SELECT *
FROM "orders"
WHERE (NOT(grepl('special.*?requests', "o_comment")))) "RHS"
ON ("LHS"."c_custkey" = "RHS"."o_custkey")
) "q01"
GROUP BY "c_custkey") "q02"
GROU
Execution halted
Running engine=duckdb query_id=14 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [==============================>----------------------]  59% in  2m eta:  1m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error: Garbage collection 15 = 11+1+3 (level 0) ... 
42.7 Mbytes of cons cells used (62%)
10.1 Mbytes of vectors used (16%)
Garbage collection 16 = 12+1+3 (level 0) ... 
44.6 Mbytes of cons cells used (65%)
10.7 Mbytes of vectors used (17%)
Garbage collection 17 = 13+1+3 (level 0) ... 
46.0 Mbytes of cons cells used (67%)
11.1 Mbytes of vectors used (17%)
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query SELECT 100.0 * SUM(CASE WHEN (grepl('^PROMO', "p_type")) THEN ("l_extendedprice" * (1.0 - "l_discount")) WHEN NOT(grepl('^PROMO', "p_type")) THEN (0.0) END) / SUM("l_extendedprice" * (1.0 - "l_discount")) AS "promo_revenue"
FROM (SELECT "l_orderkey", "l_partkey", "l_suppkey", "l_linenumber", "l_quantity", "l_extendedprice", "l_discount", "l_tax", "l_returnflag", "l_linestatus", "l_shipdate", "l_commitdate", "l_receiptdate", "l_shipinstruct", "l_shipmode", "l_comment", "p_name", "p_mfgr", "p_brand", "p_type", "p_size", "p_container", "p_retailprice", "
Execution halted
Running engine=duckdb query_id=15 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=16 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [===================================>-----------------]  68% in  2m eta:  1m
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error: Garbage collection 15 = 11+1+3 (level 0) ... 
42.7 Mbytes of cons cells used (62%)
10.1 Mbytes of vectors used (16%)
Garbage collection 16 = 12+1+3 (level 0) ... 
44.6 Mbytes of cons cells used (65%)
10.7 Mbytes of vectors used (17%)
Garbage collection 17 = 13+1+3 (level 0) ... 
46.7 Mbytes of cons cells used (68%)
11.3 Mbytes of vectors used (18%)
Garbage collection 18 = 14+1+3 (level 0) ... 
47.0 Mbytes of cons cells used (69%)
11.4 Mbytes of vectors used (18%)
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query SELECT "p_brand", "p_type", "p_size", "supplier_cnt"
FROM (SELECT "p_brand", "p_type", "p_size", COUNT(DISTINCT "ps_suppkey") AS "supplier_cnt"
FROM (SELECT "p_partkey", "p_name", "p_mfgr", "p_brand", "p_type", "p_size", "p_container", "p_retailprice", "p_comment", "ps_suppkey"
FROM (SELECT *
FROM "part"
WHERE (("p_brand" != 'Brand#45') AND (NOT(grepl('^MEDIUM POLISHED', "p_type"))) AND ("p_size" IN (49.0, 14.0, 23.0, 45.0, 19.0, 3.0, 3
Execution halted
Running engine=duckdb query_id=17 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=18 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=19 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
Running engine=duckdb query_id=20 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [=============================================>-------]  86% in  2m eta: 22s
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error: Garbage collection 15 = 11+1+3 (level 0) ... 
42.7 Mbytes of cons cells used (62%)
10.1 Mbytes of vectors used (16%)
Garbage collection 16 = 12+1+3 (level 0) ... 
44.6 Mbytes of cons cells used (65%)
10.7 Mbytes of vectors used (17%)
Garbage collection 17 = 13+1+3 (level 0) ... 
46.7 Mbytes of cons cells used (68%)
11.4 Mbytes of vectors used (18%)
Garbage collection 18 = 14+1+3 (level 0) ... 
47.0 Mbytes of cons cells used (69%)
11.4 Mbytes of vectors used (18%)
Garbage collection 19 = 15+1+3 (level 0) ... 
47.2 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Garbage collection 20 = 16+1+3 (level 0) ... 
47.2 Mbytes of cons cells used (69%)
11.5 Mbytes of vectors used (18%)
Error in .local(conn, statement, ...) : 
  duckdb_prepare_R: Failed to prepare query SELECT "s_name", "s_address"
FROM (SELECT * FROM (SELECT "s_suppkey", "s_name", "s_address"
FROM (SELECT "s_suppkey", "s_name", "s_address", "s_nationkey", "s_phone", "s_acctbal", "s_comment", "n_name", "n_
Execution halted
Running engine=duckdb query_id=21 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [===============================================>-----]  91% in  2m eta: 14s
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Catalog Error: Table with name "lineitem" already exists!
Calls: run_bm ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Execution halted
Running engine=duckdb query_id=22 format=native scale_factor=1 memory_map=FALSE cpu_count=10 lib_path=latest name=tpc_h
  [==================================================>--]  95% in  3m eta:  7s
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query
Error: Catalog Error: Table with name "customer" already exists!
Calls: run_bm ... dbSendQuery -> .local -> duckdb_result -> duckdb_execute
Execution halted
  [=====================================================] 100% in  3m eta:  0s
10 benchmarks errored:
   engine query_id format scale_factor memory_map cpu_count lib_path
2  duckdb        2 native            1      FALSE        10   latest
7  duckdb        7 native            1      FALSE        10   latest
8  duckdb        8 native            1      FALSE        10   latest
9  duckdb        9 native            1      FALSE        10   latest
13 duckdb       13 native            1      FALSE        10   latest
14 duckdb       14 native            1      FALSE        10   latest
16 duckdb       16 native            1      FALSE        10   latest
20 duckdb       20 native            1      FALSE        10   latest
21 duckdb       21 native            1      FALSE        10   latest
22 duckdb       22 native            1      FALSE        10   latest
Total run time: 2.528341 mins
boshek commented 2 years ago

@jonkeane Is there somewhere specific to document these or is that part of the task?

jonkeane commented 2 years ago

We could upstream them as issues to duckdb if the problem is duckdb, tho it might take a decent amount of digging to confirm that our dplyr implementations aren't doing something maliciously wrong (though they all work with arrow!)

boshek commented 2 years ago

So the following tpc-h queries fail for DuckDB: 2, 9, 13, 14, 16, 20 and 21.

For everything except query 21, the issue is the use of grepl which works in arrow but not in duckdb (and indeed most DBI backends).

For 21 it is this use of any here.

So for this, one fix is an upstream dbplyr/duckdb implementation. Another (shorter term) idea is to modify the valid parameters to exclude running the above queries.

duckdb grepl reprex ``` r library(duckdb, warn.conflicts = FALSE) #> Loading required package: DBI library(dplyr, warn.conflicts = FALSE) data("flights", package = "nycflights13") flights %>% filter(grepl("N", tailnum)) #> # A tibble: 334,264 × 19 #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> #> 1 2013 1 1 517 515 2 830 819 #> 2 2013 1 1 533 529 4 850 830 #> 3 2013 1 1 542 540 2 923 850 #> 4 2013 1 1 544 545 -1 1004 1022 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> 7 2013 1 1 555 600 -5 913 854 #> 8 2013 1 1 557 600 -3 709 723 #> 9 2013 1 1 557 600 -3 838 846 #> 10 2013 1 1 558 600 -2 753 745 #> # … with 334,254 more rows, and 11 more variables: arr_delay , #> # carrier , flight , tailnum , origin , dest , #> # air_time , distance , hour , minute , time_hour con <- dbConnect(duckdb()) duckdb_register(con, "flights", flights) tbl(con, "flights") %>% filter(grepl("N", tailnum)) %>% collect() #> Error: duckdb_prepare_R: Failed to prepare query SELECT * #> FROM "flights" #> WHERE (grepl('N', "tailnum")) #> Error: Catalog Error: Scalar Function with name grepl does not exist! #> Did you mean "ceil"? #> LINE 3: WHERE (grepl('N', "tailnum")) #> ^ dbDisconnect(con) ```
jonkeane commented 2 years ago

Oh, this is fantastic, thanks for all this detail!

I wonder if there's something other than grepl we could use that would get translated correctly?

And for the any could we do sum(...) > 1? That might work just fine in arrow|dplyr and duckdb too

boshek commented 2 years ago

And for the any could we do sum(...) > 1? That might work just fine in arrow|dplyr and duckdb too

This was pretty easy to solve.

I actually don't think we have anything easy for grepl that is available in both R and arrow. I opened an issue in dbplyr but so as not to assume anything there, I wonder if the best path is to skip those benchmarks for now?

The other path that seems clunky is to run either grepl or %like% depending on the engine. But that would add quite a bit of additional code and potential maintenance.

jonkeane commented 2 years ago

nods yeah. This might be a bit odd, but one thing we could try temporarily is to add a grep_func argument to the query functions https://github.com/ursacomputing/arrowbench/blob/4022c6e37a8aa2eebd579a8d710ee5e05c1c7372/R/tpch-queries.R#L8 and then have that be grep_func = grepl for non-duckdb cases and grep_func = `%like%` for duckdb and see if that works?

Ultimately, it's ok if we know the issue is grepl and translating that into duckdb's sql in some way — we can run the SQL-based queries for duckdb, which is better anyway!