voltrondata / substrait-r

An R Interface to the 'Substrait' Cross-Language Serialization for Relational Algebra
Other
26 stars 5 forks source link

Inconsistent project behaviour #262

Open thisisnic opened 1 year ago

thisisnic commented 1 year ago

I'm getting inconsistent behaviour depending on whether I supply all fields in a substrait_project() or not (and differing behaviour between Arrow and DuckDB) to the point where it's unclear what the expected behaviour should be.

library(dplyr)
library(substrait)

# y/x/z 
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(x, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       y     x     z
#>   <int> <int> <dbl>
#> 1     4     1     2
#> 2     5     2     3
#> 3     6     3     4

# x/y/z
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(y, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     4     2
#> 2     2     5     3
#> 3     3     6     4

# x/y/z
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(x, y, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     4     2
#> 2     2     5     3
#> 3     3     6     4

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(z = x + 1) %>%
  collect()
#> Error: Invalid: Invalid emit case
#> /home/nic2/arrow/cpp/src/arrow/engine/substrait/serde.cc:157  FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), ext_set, conversion_options)

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(x, z = x + 1) %>%
  collect()
#> Error: Binder Error: Positional reference 3 out of range (total 2 columns)

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(y, z = x + 1) %>%
  collect()
#> Error: Binder Error: Positional reference 3 out of range (total 2 columns)

# success
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(x, y, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 3
#>       x     y     z
#>   <int> <int> <dbl>
#> 1     1     4     2
#> 2     2     5     3
#> 3     3     6     4

# error
tibble::tibble(x = 1:3, y = 4:6) %>%
  duckdb_substrait_compiler() %>%
  substrait_project(z = x + 1) %>%
  collect()
#> Error: Binder Error: Positional reference 2 out of range (total 1 columns)
paleolimbot commented 1 year ago

The behaviour for substrait_project() is definitely strange here! In general substrait_project() always appends columns (rather than replaces), but I forget the details and our emit case does look strange:

library(dplyr, warn.conflicts = FALSE)
library(substrait, warn.conflicts = FALSE)

projected <- tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_project(x, z = x + 1)

# Seems correct: we append two fields to the output
projected$rel$project$expressions
#> [[1]]
#> message of type 'substrait.Expression' with 1 field set
#> selection {
#>   direct_reference {
#>     struct_field {
#>     }
#>   }
#>   root_reference {
#>   }
#> }
#> 
#> [[2]]
#> message of type 'substrait.Expression' with 1 field set
#> scalar_function {
#>   function_reference: 2
#>   output_type {
#>     fp64 {
#>       nullability: NULLABILITY_NULLABLE
#>     }
#>   }
#>   arguments {
#>     value {
#>       selection {
#>         direct_reference {
#>           struct_field {
#>           }
#>         }
#>         root_reference {
#>         }
#>       }
#>     }
#>   }
#>   arguments {
#>     value {
#>       literal {
#>         fp64: 1
#>       }
#>     }
#>   }
#>   options {
#>     name: "overflow"
#>     preference: "SILENT"
#>   }
#> }

# Incorrect: the emit should be 0, 1, 2, 3
projected$rel$project$common$emit
#> message of type 'substrait.RelCommon.Emit' with 1 field set
#> output_mapping: 1
#> output_mapping: 2
#> output_mapping: 3

# Incorrect: the names should be x, y, x, z
projected$schema$names
#> [1] "y" "x" "z"

In the meantime, I think maybe you could use substrait_select()?

library(dplyr)
library(substrait)

# y/x/z 
tibble::tibble(x = 1:3, y = 4:6) %>%
  arrow_substrait_compiler() %>%
  substrait_select(x, z = x + 1) %>%
  collect()
#> # A tibble: 3 × 2
#>       x     z
#>   <int> <dbl>
#> 1     1     2
#> 2     2     3
#> 3     3     4

Created on 2023-03-14 with reprex v2.0.2