hadley / adv-r

Advanced R: a book
http://adv-r.hadley.nz
Other
2.36k stars 1.71k forks source link

A better explanation needed for section "evaluation" in 20.5.1 Quoting and unquoting #1779

Open bayeslearner opened 1 year ago

bayeslearner commented 1 year ago

I understand this error is one of the most common ones when using tidyverse inside a function. But I'm unsure I really understand it. Sorry I used GPT to make the writing clearer (not necessarily correct).


Title: Clarification Required: Tidy Evaluation's Need for Explicit "Linkage" Between Expression and Data Mask

Background: The confusion arises from the behavior of eval_tidy when evaluating an expression in the context of a provided data mask (e.g., a dataframe). Why is there a need for an explicit "linkage" between the expression and the data mask, even when the dataframe is provided directly as an argument?

Scenario:

Given the functions:

subset2 <- function(data, rows) {
  rows <- enquo(rows)
  rows_val <- eval_tidy(rows, data)
  stopifnot(is.logical(rows_val))
  data[rows_val, , drop = FALSE]
}

subsample <- function(df, cond, n = nrow(df)) {
  df <- subset2(df, cond)
  resample(df, n)
}

When calling:

df <- data.frame(x = c(1, 1, 1, 2, 2), y = 1:5)
subsample(df, x == 1)

The error thrown is: Error in eval_tidy(rows, data): object 'x' not found.

Inferred Understanding from the Error:

Primary Concern:

  1. Why does eval_tidy, inside subset2, require a quosure (with embedded environment information) to evaluate the rows expression correctly, even when the dataframe is provided directly?
  2. Would traditional eval exhibit similar behavior?

Explanation:

  1. Lazy Evaluation in R: R employs lazy evaluation for function arguments. When subsample(df, x == 1) is called, the expression x == 1 isn't evaluated right away. Instead, it is evaluated when cond is referenced within the function.

  2. Execution Inside subset2 and Role of Data in Quosure's Environment: The rows_val <- eval_tidy(rows, data) line is where cond (sent as rows) is actually evaluated.

    • Although the data frame (data) is provided to eval_tidy, it doesn't set the evaluation environment for rows by itself. Instead, eval_tidy references the environment contained within the quosure. The crucial insight here is that for the evaluation to be successful, the data (in this case, the dataframe) needs to be available within the environment of the quosure.

    • The quosure captures both the expression and its associated environment. This is designed to ensure that the expression can be evaluated in the right context. The data mask is expected to be part of this environment or context. When eval_tidy evaluates a quosure, it merges the data mask with the quosure's environment. Symbols in the expression are first looked up in the data mask, then in the quosure's environment, and then in parent environments. The absence of this linkage between the data and the quosure's environment causes the evaluation error.

  3. Role of enquo & Immediate Unquoting: The enquo function captures an expression and its surrounding environment into a quosure. This allows for a connection or "linkage" between the expression and its context. In our subsample function, the immediate unquoting !! pulls down the cond expression directly into the function, ensuring that the expression and its context are evaluated together. This is crucial for the subsequent subset2 function to interpret and evaluate the expression in the right context with eval_tidy.

    subsample <- function(df, cond, n = nrow(df)) {
    cond <- enquo(cond)
    df <- subset2(df, !!cond)
    resample(df, n)
    }
  4. Comparison with Traditional eval: Using traditional eval, the expression would be evaluated directly within the given environment. This approach doesn't rely on the embedded environment within a quosure, and thus, the behavior may differ from eval_tidy.