markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Group_by bug with purrr::reduce #808

Closed sousaru closed 6 months ago

sousaru commented 6 months ago

I was having a bug while using purrr::reduce to compute lags in a data.frame and realized it came down to the use of tidytable::group_by.

I provide a MRE:

data = base::data.frame(x = 1:10, y = 1:10, w = c(rep(1,5),rep(0,5)))

#for variable naming
my_sign_adj = function(k) paste0(base::ifelse(k>=0,'p','n'),abs(k))

#function to compute lags or leads
my_lead = function(x,n) { 
  #should be grouped already if needed
  if(n>0) { return(dplyr::lead(x,n)) } else { return(dplyr::lag(x,-1*n)) }
}

#Calculate multiple lags/leads
calculate_lags = function(df, var, lags){
  #should be grouped already if needed

  #lags are negative
  #leads are positive

  #function mapping for each    
  map_lead  =  purrr::map(lags, ~ purrr::partial(my_lead, n = .x))

  #adjust df
  df = dplyr::mutate(df, across(.cols = {{var}}, 
                         .fns = map_lead, 
                         .names = "{.col}_t{my_sign_adj(lags)}"))

  return(df)
}

purrr::reduce(c('x','y')
  ,calculate_lags
  ,1:3
  ,.init = tidytable::group_by(data, w)
)

# # A tidytable: 10 × 9
# # Groups:     
# x     y     w x_tp1 x_tp2 x_tp3 y_tp1 y_tp2 y_tp3
# <int> <int> <dbl> <int> <int> <int> <int> <int> <int>
#   1     1     1     1     2     3     4     2     3     4
# 2     2     2     1     3     4     5     3     4     5
# 3     3     3     1     4     5     6     4     5     6
# 4     4     4     1     5     6     7     5     6     7
# 5     5     5     1     6     7     8     6     7     8
# 6     6     6     0     7     8     9     7     8     9
# 7     7     7     0     8     9    10     8     9    10
# 8     8     8     0     9    10    NA     9    10    NA
# 9     9     9     0    10    NA    NA    10    NA    NA
# 10    10    10     0    NA    NA    NA    NA    NA    NA

purrr::reduce(c('x','y')
              ,calculate_lags
              ,1:3
              ,.init = dplyr::group_by(data, w)
)

# # A tibble: 10 × 9
# # Groups:   w [2]
# x     y     w x_tp1 x_tp2 x_tp3 y_tp1 y_tp2 y_tp3
# <int> <int> <dbl> <int> <int> <int> <int> <int> <int>
#   1     1     1     1     2     3     4     2     3     4
# 2     2     2     1     3     4     5     3     4     5
# 3     3     3     1     4     5    NA     4     5    NA
# 4     4     4     1     5    NA    NA     5    NA    NA
# 5     5     5     1    NA    NA    NA    NA    NA    NA
# 6     6     6     0     7     8     9     7     8     9
# 7     7     7     0     8     9    10     8     9    10
# 8     8     8     0     9    10    NA     9    10    NA
# 9     9     9     0    10    NA    NA    10    NA    NA
# 10    10    10     0    NA    NA    NA    NA    NA    NA
markfairbanks commented 6 months ago

Hmm it looks like the issue isn't tidytable::group_by() and it isn't purrr:reduce() either.

Part of the issue is in calculate_lags() you're using dplyr::mutate()/dplyr::across() on a grouped tidytable - which doesn't work.

library(tidytable)

df <- data.frame(x = 1:3, y = c("a", "a", "b"))

df %>%
  tidytable::group_by(y) %>%
  tidytable::mutate(x_lag_correct = lag(x)) %>%
  dplyr::mutate(x_lag_wrong = lag(x))
#> # A tidytable: 3 × 4
#> # Groups:     
#>       x y     x_lag_correct x_lag_wrong
#>   <int> <chr>         <int>       <int>
#> 1     1 a                NA          NA
#> 2     2 a                 1           1
#> 3     3 b                NA           2

However if you change this to the tidytable versions you would run into another issue - you pass map_lead to across(), where map_lead is a list of functions you are dynamically creating. Unfortunately that doesn't work in tidytable::across(), and there is no way to make it work given the translation constraints of tidytable::across().

sousaru commented 6 months ago

Thanks! Should have added the fact that using tidytable for across/mutate yields a different error.

map_lead creates a list of purr-style lambda functions, which I thought worked with tidytable::mutate(across(...))`. Could you elaborate on why it does not? Is this flagged in the documentation?

Thanks for your time!

markfairbanks commented 5 months ago

A list() call works, but a list passed as a variable doesn't.

library(tidytable)

df <- tidytable(x = 1:3, y = 4:6)

# Works
df %>%
  summarize(
    across(c(x, y), list(mean, sum))
  )
#> # A tidytable: 1 × 4
#>     x_1   x_2   y_1   y_2
#>   <dbl> <int> <dbl> <int>
#> 1     2     6     5    15

# Fails
funs <- list(mean, sum)

df %>%
  summarize(
    across(c(x, y), funs)
  )
#> Error in funs(x): could not find function "funs"

I think you're right I need to document that difference somewhere.

markfairbanks commented 5 months ago

Could you elaborate on why it does not?

Essentially the idea of tidytable::across() is it gets expanded to a regular function call - no evaluation occurs until the expansion is completed.

So this...

df %>%
  summarize(across(c(x, y), mean))

becomes this...

df %>%
  summarize(x = mean(x),
            y = mean(y))

To get this to work you have to account for 3 cases

  1. .fns = mean assumes a function name and gets converted to mean(col)
  2. .fns = ~ mean(.x) replaces the .x and gets converted to mean(col)
  3. .fns = list(mean, max) combines the assumptions from the other 2 - either straight function conversions or converting a lambda function.

Since I'm directly altering the expressions in the background it's tough to know what the intent is when passing a list of functions.