voltrondata / substrait-r

An R Interface to the 'Substrait' Cross-Language Serialization for Relational Algebra
Other
27 stars 7 forks source link

Substrait R Interface

R-CMD-check Codecov test
coverage

Substrait is a cross-language specification for data compute operations. The Substrait specification creates a standard for compute expressions—or what should be done to data.

The Substrait R package provides an R interface to Substrait, allowing users to construct a Substrait plan from R for evaluation by a Substrait consumer, such as Arrow or DuckDB. This is an experimental package that is under heavy development!

Installation

You can install the development version of substrait from GitHub with:

# install.packages("remotes")
remotes::install_github("voltrondata/substrait-r")

Example

Basic construction of a Substrait plan and evaluating it using the DuckDB Substrait consumer:

library(substrait)
library(dplyr)

mtcars %>% 
  duckdb_substrait_compiler() %>%
  mutate(mpg_plus_one = mpg + 1) %>% 
  select(mpg, wt, mpg_plus_one) %>% 
  collect()
#> # A tibble: 32 × 3
#>      mpg    wt mpg_plus_one
#>    <dbl> <dbl>        <dbl>
#>  1  21    2.62         22  
#>  2  21    2.88         22  
#>  3  22.8  2.32         23.8
#>  4  21.4  3.22         22.4
#>  5  18.7  3.44         19.7
#>  6  18.1  3.46         19.1
#>  7  14.3  3.57         15.3
#>  8  24.4  3.19         25.4
#>  9  22.8  3.15         23.8
#> 10  19.2  3.44         20.2
#> # … with 22 more rows

You can inspect the Plan that will be generated by saving the result and calling $plan():

compiler <- data.frame(col1 = 1L) %>% 
  duckdb_substrait_compiler() %>%
  mutate(mpg_plus_one = col1 + 1)

compiler$plan()
#> message of type 'substrait.Plan' with 3 fields set
#> extension_uris {
#>   extension_uri_anchor: 1
#> }
#> extensions {
#>   extension_function {
#>     extension_uri_reference: 1
#>     function_anchor: 2
#>     name: "+"
#>   }
#> }
#> relations {
#>   root {
#>     input {
#>       project {
#>         common {
#>           emit {
#>             output_mapping: 1
#>             output_mapping: 2
#>           }
#>         }
#>         input {
#>           read {
#>             base_schema {
#>               names: "col1"
#>               struct_ {
#>                 types {
#>                   i32 {
#>                     nullability: NULLABILITY_NULLABLE
#>                   }
#>                 }
#>               }
#>             }
#>             named_table {
#>               names: "named_table_1"
#>             }
#>           }
#>         }
#>         expressions {
#>           selection {
#>             direct_reference {
#>               struct_field {
#>               }
#>             }
#>             root_reference {
#>             }
#>           }
#>         }
#>         expressions {
#>           scalar_function {
#>             function_reference: 2
#>             output_type {
#>               i32 {
#>                 nullability: NULLABILITY_NULLABLE
#>               }
#>             }
#>             arguments {
#>               value {
#>                 selection {
#>                   direct_reference {
#>                     struct_field {
#>                     }
#>                   }
#>                   root_reference {
#>                   }
#>                 }
#>               }
#>             }
#>             arguments {
#>               value {
#>                 literal {
#>                   fp64: 1
#>                 }
#>               }
#>             }
#>           }
#>         }
#>       }
#>     }
#>     names: "col1"
#>     names: "mpg_plus_one"
#>   }
#> }

You can also construct a Substrait plan and evaluate it using the Acero Substrait consumer. To use the Acero Substrait consumer you will need a special build of the arrow package with Arrow configured using -DARROW_SUBSTRAIT=ON.

library(substrait)
library(dplyr)

mtcars %>% 
  arrow_substrait_compiler() %>%
  mutate(mpg_plus_one = mpg + 1) %>% 
  select(mpg, wt, mpg_plus_one) %>% 
  collect()
#> # A tibble: 32 × 3
#>      mpg    wt mpg_plus_one
#>    <dbl> <dbl>        <dbl>
#>  1  21    2.62         22  
#>  2  21    2.88         22  
#>  3  22.8  2.32         23.8
#>  4  21.4  3.22         22.4
#>  5  18.7  3.44         19.7
#>  6  18.1  3.46         19.1
#>  7  14.3  3.57         15.3
#>  8  24.4  3.19         25.4
#>  9  22.8  3.15         23.8
#> 10  19.2  3.44         20.2
#> # … with 22 more rows

Create ‘Substrait’ proto objects

You can create Substrait proto objects using the substrait base object or using substrait_create():

substrait$Type$Boolean$create()
#> message of type 'substrait.Type.Boolean' with 0 fields set
substrait_create("substrait.Type.Boolean")
#> message of type 'substrait.Type.Boolean' with 0 fields set

You can convert an R object to a Substrait object using as_substrait(object, type):

(msg <- as_substrait(4L, "substrait.Expression"))
#> message of type 'substrait.Expression' with 1 field set
#> literal {
#>   i32: 4
#> }

The type can be either a string of the qualified name or an object (which is needed to communicate certain types like "substrait.Expression.Literal.Decimal" which has a precision and scale in addition to the value).

Restore an R object from a Substrait object using from_substrait(message, prototype):

from_substrait(msg, integer())
#> [1] 4

Substrait objects are list-like (i.e., methods defined for [[ and $), so you can get or set fields. Note that unset is different than NULL (just like an R list).

msg$literal <- substrait$Expression$Literal$create(i32 = 5L)
msg
#> message of type 'substrait.Expression' with 1 field set
#> literal {
#>   i32: 5
#> }

The constructors are currently implemented using about 1000 lines of auto-generated code made by inspecting the nanopb-compiled .proto files and RProtoBuf. This is probably not the best final approach but allows us to get started writing good as_substrait() and from_substrait() methods for various types of R objects.