tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.74k stars 2.12k forks source link

Variable scoping issue with .data inside lambda Ffunction used in across() #7016

Open bakaburg1 opened 4 months ago

bakaburg1 commented 4 months ago

Hello,

This error was driving me crazy and took me a while to isolate it. if one wants to use .data into a lambda function into across, you cannott index .data with variables created into the lambda function itself, otherwise R will complain that such variable doesn't exist!

example:

iris |> mutate(
    across("Sepal.Length", \(x) {
        other_col_name <- "Sepal.Width"

        other_col_val <- .data[[other_col_name]]

        x + other_col_val
    })
)
> Error in .f(.x[[i]], ...) : object 'other_col_name' not found

Which was driving me crazy, since the variable clearly exists in the scope.

but if other_col_name is defined outside no problem:

other_col_name <- "Sepal.Width"
iris |> mutate(
    across("Sepal.Length", \(x) {

        other_col_val <- .data[[other_col_name]]

        x + other_col_val
    })
)
# All fine

If this is by design and you don't plan to fix it, could be useful to have a clearer error and some extra documentation somewhere!

There cases in which for example the index name is defined dynamically based on the x or cur_column() value. Now that I now the issue I'll use pick()[1], unless you have better solutions.

MrFlick commented 4 months ago

I noticed a possibly related problem from this Stack Overflow question. This code results in an error

iris |>
  group_nest(Species) |>
  mutate(boo = map(data, function(x) {
    colsym <- sym("Sepal.Length")
    x %>% mutate(newcol =! !colsym)
  }))
# Error: object 'colsym' not found

Same issue if using .data

iris |>
  group_nest(Species) |>
  mutate(boo = map(data, function(x) {
    colname <- "Sepal.Length"
    x %>% mutate(uncle=.data[[colname]])
  }))

But if you define the function first rather than inline, it will run

helper <- \(x) {
  colsym <- sym("Sepal.Length")
  x %>% mutate(newcol=!!colsym)
}

iris |>
  group_nest(Species) |>
  mutate(data = map(data, helper))

Looking at the trace, it seems the problem is actually coming from rlang::quos. Something else that will trigger the error is

quos(\(x) {bee <- colsym::sym("a"); mutate(x, newcol=!!colsym)})

Basically the !! part is being evaluated when defining the function, not when calling the function. Is there a way to delay the evaluation of the !! or .data[[]] when applied to functions? Tested with rlang_1.1.3, purrr_1.0.2, dplyr_1.1.4

moodymudskipper commented 4 months ago

You'd have the same issue if you nest bquote() calls, the .() are substituted by the outside calls, similar if you nest substitute() calls.

You can use the following trick:

protect <- function(expr) call("!", call("!", substitute(expr)))
rlang::expr(!!protect(hello))
#> !!hello
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.3
#> Warning: package 'stringr' was built under R version 4.2.3
iris |>
  group_nest(Species) |>
  mutate(boo = map(data, function(x) {
    colsym <- sym("Sepal.Length")
    x %>% mutate(newcol = !!protect(colsym))
  }))
#> # A tibble: 3 × 3
#>   Species                  data boo              
#>   <fct>      <list<tibble[,4]>> <list>           
#> 1 setosa               [50 × 4] <tibble [50 × 5]>
#> 2 versicolor           [50 × 4] <tibble [50 × 5]>
#> 3 virginica            [50 × 4] <tibble [50 × 5]>

For data it's a bit different I think, in the given examples .data shouldn't be used, it's not to be considered as an object in scope, but as a special operator at the top level for mutate, it's not very clear from the dot though. It seems a macro is run (so if you set a browser() in the function it won't be triggered for instance) when we use .data, where using means calling it with brackets. This works around it :

iris |>
  group_nest(Species) |>
  mutate(boo = map(data, ~ {
    colname <- "Sepal.Length"
    .x %>% mutate(uncle= (.data)[[colname]])
  }))
#> # A tibble: 3 × 3
#>   Species                  data boo              
#>   <fct>      <list<tibble[,4]>> <list>           
#> 1 setosa               [50 × 4] <tibble [50 × 5]>
#> 2 versicolor           [50 × 4] <tibble [50 × 5]>
#> 3 virginica            [50 × 4] <tibble [50 × 5]>