voltrondata / substrait-r

An R Interface to the 'Substrait' Cross-Language Serialization for Relational Algebra
Other
27 stars 7 forks source link

grepl binding for DuckDB doesn't work with regex included #267

Open thisisnic opened 1 year ago

thisisnic commented 1 year ago
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(substrait)
#> 
#> Attaching package: 'substrait'
#> The following object is masked from 'package:stats':
#> 
#>     filter

tbl <- tibble::tibble(x = c("apple pie", "pork pie", "pork chop"))

# dplyr - all good without regex
tbl %>% 
  filter(grepl("pie", x))
#> # A tibble: 2 × 1
#>   x        
#>   <chr>    
#> 1 apple pie
#> 2 pork pie

# substrait - all good without regex
tbl %>% 
  duckdb_substrait_compiler() %>%
  filter(grepl("pie", x)) %>%
  collect()
#> # A tibble: 2 × 1
#>   x        
#>   <chr>    
#> 1 apple pie
#> 2 pork pie

# dplyr - all good with regex
tbl %>% 
  filter(grepl("pie$", x))
#> # A tibble: 2 × 1
#>   x        
#>   <chr>    
#> 1 apple pie
#> 2 pork pie

# substrait - doesn't work without regex
tbl %>% 
  duckdb_substrait_compiler() %>%
  filter(grepl("pie$", x)) %>%
  collect()
#> # A tibble: 0 × 1
#> # … with 1 variable: x <chr>

Created on 2023-04-05 with reprex v2.0.2

thisisnic commented 1 year ago

The Substrait spec doesn't state whether regex can or cannot be used in the functions, so this is something we want to raise there. I suspect that the contains() function which has been bound to R's grepl() here doesn't translate to something which allows regex; in the Substrait string function spec there are function like count_substring and regexp_count_substring; implying differing version of functions depending on whether regex are allowed. There is currently no regexp_contains. We should probably open an issue on the Substrait repo, and in the meantime use a workaround (e.g. could we perhaps bind regexp_count_substring(x) > 0 to grepl())?