voltrondata / substrait-r

An R Interface to the 'Substrait' Cross-Language Serialization for Relational Algebra
Other
27 stars 7 forks source link

Provide some TPC-H utilities for more realistic tests #223

Closed paleolimbot closed 1 year ago

paleolimbot commented 2 years ago

I opted to put them as .csv files in inst/; however, another approach would be to use_package_data() such that the current tpch_tables() would reduce to tpch0001 and read_tpch_df("customer") would reduce to tpch0001$customer.

The idea is that you can do stuff like the following to write tests for these:

library(substrait, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

# Use to test some dplyr operations with more realistic tables
substrait:::read_tpch_df("customer") |> 
  arrow_substrait_compiler() |> 
  select(c_custkey:c_address) |> 
  collect()
#> # A tibble: 150 × 3
#>    c_custkey c_name             c_address                              
#>        <int> <chr>              <chr>                                  
#>  1         1 Customer#000000001 "j5JsirBM9PsCy0O1m"                    
#>  2         2 Customer#000000002 "487LW1dovn6Q4dMVymKwwLE9OKf3QG"       
#>  3         3 Customer#000000003 "fkRGN8nY4pkE"                         
#>  4         4 Customer#000000004 "4u58h fqkyE"                          
#>  5         5 Customer#000000005 "hwBtxkoBF qSW4KrIk5U 2B1AU7H"         
#>  6         6 Customer#000000006 " g1s,pzDenUEBW3O,2 pxu0f9n2g64rJrt5E" 
#>  7         7 Customer#000000007 "8OkMVLQ1dK6Mbu6WG9 w4pLGQ n7MQ"       
#>  8         8 Customer#000000008 "j,pZ,Qp,qtFEo0r0c 92qobZtlhSuOqbE4JGV"
#>  9         9 Customer#000000009 "vgIql8H6zoyuLMFNdAMLyE7 H9"           
#> 10        10 Customer#000000010 "Vf mQ6Ug9Ucf5OKGYq fsaX AtfsO7,rwY"   
#> # … with 140 more rows

# Use to back out a substrait plan from a TPC-H query
# https://github.com/duckdb/duckdb/tree/master/extension/tpch/dbgen/queries
plan <- duckdb_get_substrait(
  "
  SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;
  ",
  tables = substrait:::tpch_tables()
)

# look at extension functions used
plan$extensions
#> [[1]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 1
#>   name: "lte"
#> }
#> 
#> [[2]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 2
#>   name: "is_not_null"
#> }
#> 
#> [[3]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 3
#>   name: "and"
#> }
#> 
#> [[4]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 4
#>   name: "*"
#> }
#> 
#> [[5]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 5
#>   name: "-"
#> }
#> 
#> [[6]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 6
#>   name: "sum_no_overflow"
#> }
#> 
#> [[7]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 7
#>   name: "sum"
#> }
#> 
#> [[8]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 8
#>   name: "+"
#> }
#> 
#> [[9]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 9
#>   name: "avg"
#> }
#> 
#> [[10]]
#> message of type 'substrait.extensions.SimpleExtensionDeclaration' with 1 field set
#> extension_function {
#>   function_anchor: 10
#>   name: "count_star"
#> }

# View interactively in RStudio
# View(plan)

Created on 2022-11-28 with reprex v2.0.2

paleolimbot commented 2 years ago

datalogistik seems like an excellent choice for complex scenarios...here I think that implementing the Python + reticulate + virtualenv + test-skipping-if-one-of-those-doesn't-work dance is excessive when all we need is a few .csvs to get started. If our tests become any more complex than "make sure this compiles", that may be a good time to revisit the need for a Python dependency here. I'm also happy to review a PR if you or somebody more familiar with the tool would like to implement it here.

As far as I can tell, the license on the site applies to the software. I think DuckDB rewrote dbgen to avoid the license issue (which says you're not allowed to modify the software), which I believe makes this redistribution more tied to DuckDB's license? I added a README with all the disclaimers I can think of.

paleolimbot commented 1 year ago

Let's revisit this when I have time to investigate the alternatives properly!