Open thisisnic opened 1 year ago
The behaviour for substrait_project()
is definitely strange here! In general substrait_project()
always appends columns (rather than replaces), but I forget the details and our emit case does look strange:
library(dplyr, warn.conflicts = FALSE)
library(substrait, warn.conflicts = FALSE)
projected <- tibble::tibble(x = 1:3, y = 4:6) %>%
arrow_substrait_compiler() %>%
substrait_project(x, z = x + 1)
# Seems correct: we append two fields to the output
projected$rel$project$expressions
#> [[1]]
#> message of type 'substrait.Expression' with 1 field set
#> selection {
#> direct_reference {
#> struct_field {
#> }
#> }
#> root_reference {
#> }
#> }
#>
#> [[2]]
#> message of type 'substrait.Expression' with 1 field set
#> scalar_function {
#> function_reference: 2
#> output_type {
#> fp64 {
#> nullability: NULLABILITY_NULLABLE
#> }
#> }
#> arguments {
#> value {
#> selection {
#> direct_reference {
#> struct_field {
#> }
#> }
#> root_reference {
#> }
#> }
#> }
#> }
#> arguments {
#> value {
#> literal {
#> fp64: 1
#> }
#> }
#> }
#> options {
#> name: "overflow"
#> preference: "SILENT"
#> }
#> }
# Incorrect: the emit should be 0, 1, 2, 3
projected$rel$project$common$emit
#> message of type 'substrait.RelCommon.Emit' with 1 field set
#> output_mapping: 1
#> output_mapping: 2
#> output_mapping: 3
# Incorrect: the names should be x, y, x, z
projected$schema$names
#> [1] "y" "x" "z"
In the meantime, I think maybe you could use substrait_select()
?
library(dplyr)
library(substrait)
# y/x/z
tibble::tibble(x = 1:3, y = 4:6) %>%
arrow_substrait_compiler() %>%
substrait_select(x, z = x + 1) %>%
collect()
#> # A tibble: 3 × 2
#> x z
#> <int> <dbl>
#> 1 1 2
#> 2 2 3
#> 3 3 4
Created on 2023-03-14 with reprex v2.0.2
I'm getting inconsistent behaviour depending on whether I supply all fields in a
substrait_project()
or not (and differing behaviour between Arrow and DuckDB) to the point where it's unclear what the expected behaviour should be.