rstudio / pointblank

Data quality assessment and metadata reporting for data frames and database tables
https://rstudio.github.io/pointblank/
Other
862 stars 56 forks source link

A possible switch to tidyeval for `values` and `left`/`right` arguments #494

Closed yjunechoe closed 10 months ago

yjunechoe commented 11 months ago

Prework

Proposal

Currently, the value (and left/right) argument in column comparison validation functions uses vars() to resolve scope ambiguity - values passed to value are evaluated ordinarily unless protected by vars(), in which case it's resolved to a column vector instead.

I think this makes it a prime candidate for integrating tidyeval, which resolves this same scope ambiguity with .data and .env.

Pros

Perhaps the biggest PRO of all, this is the familiar and idiomatic dplyr way to do column comparisons dynamically. For example, the col_vals_gt()-equivalent logic expressed inside dplyr::filter() would look like the following:

library(dplyr)
b <- 0
df <- tibble(a = 1:2, b = 1)

# data scoping
df %>% 
  filter(a > .data$b)
#> # A tibble: 1 × 2
#>       a     b
#>   <int> <dbl>
#> 1     2     1

# env scoping
df %>% 
  filter(a > .env$b)
#> # A tibble: 2 × 2
#>       a     b
#>   <int> <dbl>
#> 1     1     1
#> 2     2     1

# tidyselect bridge pattern with `pick()` + simplify to vector with `pull()`
df %>% 
  filter(a > pull(pick(any_of(c("b", "z")))))
#> # A tibble: 1 × 2
#>       a     b
#>   <int> <dbl>
#> 1     2     1

Cons

1) The biggest hurdle for a tidyeval implementation is that the current masking pattern in value is the other way around compared to in tidyeval: the environment masks the data. This makes it difficult to recycle the existing infrastructure for tidyeval.

Instead, this might require implementing our own "poorman's tidyeval". I've attached code for a toy implementation at the end. It's probably not super ideal to introduce a lot of metaprogramming shenanigans, so I'm also very open to simplifying this (ex: we don't _have_ to support bridge pattern with `pick()`).

2) The fact that vars("x") gets deprecated in favor of .data$x (or .data[["x"]]) in value may be confusing when combined with (#493), because in the context of tidyselect, vars("x") would get deprecated in favor of all_of("x") in columns. One solution is to not necessarily signal the deprecation of vars(x) as used in the values argument - we can simply document the possibility of passing more complex expressions to value by using the .data and .env pronouns and leave it for the advanced users to discover for themselves.

3) This may pose some limitations for the yaml round-trip. I don't know to what extent there exists a "pointblank guarantee" for this, but we can always be stricter about the input.

Proposed behavior:

b <- 0
col_b <- "b"
df <- data.frame(a = 1:2, b = 1)

# backwards compatibility
df$a > values_tidyeval(b, df)
#> [1] TRUE TRUE
df$a > values_tidyeval(vars(b), df)
#> [1] FALSE  TRUE
df$a > values_tidyeval(vars("b"), df)
#> [1] FALSE  TRUE
df$a > values_tidyeval(vars(!!col_b), df)
#> [1] FALSE  TRUE

# new pattern for resolving scope ambiguity
df$a > values_tidyeval(.env$b, df) # or just `b`
#> [1] TRUE TRUE
df$a > values_tidyeval(.data$b, df)
#> [1] FALSE  TRUE

library(tidyselect)
# tidyselect variants for selecting the column "b"
df$a > values_tidyeval(pick(all_of("b")), df)
#> [1] FALSE  TRUE
df$a > values_tidyeval(pick(all_of(col_b)), df)
#> [1] FALSE  TRUE
# tidyselect extras
df$a > values_tidyeval(pick(any_of(c("b", "z"))), df)
#> [1] FALSE  TRUE
df$a > values_tidyeval(pick(2), df)
#> [1] FALSE  TRUE
Mockup implementation of `values_tidyeval()` (with messages for debugging) ```r # `vars(col)` and `vars("col")` translated to `.data$col` vars_as_dotdata <- function(col, ...) { rlang::check_dots_empty0() col_sym <- rlang::ensym(col) eval_expr <- rlang::data_sym(col_sym) message("Translated: ", deparse(eval_expr)) eval.parent(eval_expr) } # tidyselect bridge pattern support pick_pull <- function(col) { data_cols <- eval.parent(quote(names(.data))) col_chr <- rlang::inject(tidyselect::vars_pull(data_cols, {{ col }})) eval_expr <- rlang::data_sym(col_chr) message("Translated: ", deparse(eval_expr)) eval.parent(eval_expr) } values_tidyeval <- function(x, tbl) { x_quo <- rlang::enquo(x) message("Received: ", rlang::expr_deparse(rlang::quo_get_expr(x_quo))) if (!rlang::quo_is_call(x_quo)) { # Bare symbols and constants are evaluated in the env they come from rlang::eval_tidy(x_quo) } else { # If value is complex expression, evaluate in a "tidyeval" context ## 1) Extract expression and environment from quosure x_expr <- rlang::quo_get_expr(x_quo) x_env <- rlang::quo_get_env(x_quo) ## 2) Create a masking environment and introduce pronouns x_masked_env <- rlang::new_environment( data = list(.data = tbl, .env = x_env), parent = x_env ) ## 3) Replace `vars()` and `pick()` with custom implementation ## - inner substitute to mask vars/pick ## - outer substitute to unquote x_expr eval_expr <- eval(substitute(substitute( x_expr, list(vars = quote(vars_as_dotdata), pick = quote(pick_pull)) ))) message("Substituted: ", deparse(eval_expr)) eval(eval_expr, x_masked_env) } } ```

Edit: recycling eval_tidy()

Just for completeness, it turns out that it is also possible to simply swap the data and env roles and still use eval_tidy(). But this is purely an implementational issue that can be decided on later.

Mockup implementation of `reverse_tidyeval()` that still uses `rlang::eval_tidy()` ```r reverse_tidyeval <- function(x, tbl = head(mtcars)) { # 1) Capture expression as quosure x_quo <- rlang::enquo(x) # 2) Extract quosure components x_expr <- rlang::quo_get_expr(x_quo) x_env <- rlang::quo_get_env(x_quo) # 3) Reverse `.data` and `.env` in expression x_expr_reversed <- eval(substitute(substitute( x_expr, rlang::exprs(.data = .env, .env = .data) ))) # 4) Reverse data and env roles data_as_env <- rlang::new_environment(data = tbl) ## env-as-data only grabs the variables used in expression env_as_data <- rlang::env_get_list( x_env, nms = all.names(x_expr, unique = TRUE), default = NULL, inherit = TRUE ) # 5) Eval rlang::eval_tidy(x_expr_reversed, env_as_data, data_as_env) } # New .data syntax for vars() reverse_tidyeval(.data$mpg) #> [1] 21.0 21.0 22.8 21.4 18.7 18.1 hp <- 0 # Default is environment scoping reverse_tidyeval(.data$mpg * hp) #> [1] 0 0 0 0 0 0 # .env pronoun makes this explicit reverse_tidyeval(.data$mpg * .env$hp) #> [1] 0 0 0 0 0 0 # .data to scope in data reverse_tidyeval(.data$mpg * .data$hp) #> [1] 2310.0 2310.0 2120.4 2354.0 3272.5 1900.5 # bang-bang to inline values defined in a different scope local({ hp <- -1 reverse_tidyeval(.data$mpg * !!hp) }) #> [1] -21.0 -21.0 -22.8 -21.4 -18.7 -18.1 ```
rich-iannone commented 10 months ago

Thanks for all the work done in https://github.com/rstudio/pointblank/pull/493 ! Now that it’s merged, I’ll close this issue.