tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

case_when conditional always length-1 #453

Closed r2evans closed 1 year ago

r2evans commented 1 year ago

Motivated by https://stackoverflow.com/q/77327623/3358272, I wanted to play with the dtplyr version.

library(dplyr)
library(data.table)
library(TTR)
library(dtplyr)
set.seed(0)
df <- data.frame(par = rep(paste("Par", 1:5), each = 100), cat = rep(LETTERS[1:5], each = 100), val = rnorm(500, 100, 20))
head(df)
#     par cat       val
# 1 Par 1   A 125.25909
# 2 Par 1   A  93.47533
# 3 Par 1   A 126.59599
# 4 Par 1   A 125.44859
# 5 Par 1   A 108.29283
# 6 Par 1   A  69.20100

The dplyr-native function works without warning/error:

new_df <- df %>%
  group_by(par, cat) %>%
  mutate(acute = EMA(val, ratio = 2 / (1 + 7))) %>%
  mutate(chron = case_when(
    cat == "A" ~ EMA(val, ratio = 2 / (1 + 42)),
    cat == "B" ~ EMA(val, ratio = 2 / (1 + 28)),
    cat %in% c("C", "D", "E") ~ EMA(val, ratio = 2 / (1 + 14))
  ))

If we bring in dtplyr, we get an error:

lazy_dt(df) %>%
  group_by(par, cat) %>%
  mutate(acute = EMA(val, ratio = 2 / (1 + 7))) %>%
  mutate(chron = case_when(
    cat == "A" ~ EMA(val, ratio = 2 / (1 + 42)),
    cat == "B" ~ EMA(val, ratio = 2 / (1 + 28)),
    cat %in% c("C", "D", "E") ~ EMA(val, ratio = 2 / (1 + 14))
  ))
# Error in fcase(cat == "A", EMA(val, ratio = 2/(1 + 42)), cat == "B", EMA(val,  : 
#   Length of output value #2 must either be 1 or length of logical condition.

Debugging in the first conditional reveals the problem:

lazy_dt(df) %>%
  group_by(par, cat) %>%
  mutate(acute = EMA(val, ratio = 2 / (1 + 7))) %>%
  mutate(chron = case_when(
    { browser(); cat == "A"; } ~ EMA(val, ratio = 2 / (1 + 42)),
    cat == "B" ~ EMA(val, ratio = 2 / (1 + 28)),
    cat %in% c("C", "D", "E") ~ EMA(val, ratio = 2 / (1 + 14))
  ))
# Called from: fcase({
#     browser()
#     cat == "A"
#   ...
debug: cat == "A"
cat
# [1] "A"

whereas in dplyr, it's more intuitively:

df %>%
  group_by(par, cat) %>%
  mutate(acute = EMA(val, ratio = 2 / (1 + 7))) %>%
  mutate(chron = case_when(
    { browser(); cat == "A"; } ~ EMA(val, ratio = 2 / (1 + 42)),
    cat == "B" ~ EMA(val, ratio = 2 / (1 + 28)),
    cat %in% c("C", "D", "E") ~ EMA(val, ratio = 2 / (1 + 14))
  ))
# Called from: eval_tidy(pair$lhs, env = default_env)
debug at #5: cat == "A"
cat
#   [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"
#  [52] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"

(Note that the group has 100 rows, and the EMA(..) call in both implementations returns a vector of length 100.)

I believe this is because in data.table, grouping variables tend to be of reduced size. Looking at the data.table-native code,

new_dt <- as.data.table(df)
new_dt[, `:=`(
  acute = {browser();EMA(val, ratio=2 / (1 + 7));},
  chron = EMA(val, ratio=2 / (1 + fcase(cat == 'A', 42, cat == 'B', 28, TRUE, 14)))
), by=list(par, cat)]
# debug at #2: EMA(val, ratio = 2/(1 + 7))
cat
# [1] "A"
.BY
# $par
# [1] "Par 1"
# $cat
# [1] "A"

I think this is somewhat "intentional" (or at least "known") in data.table. While the underlying behavior (cat is length 1 due to it being a grouping variable) cannot be changed here, there is something else going on when the dplyr and data.table native implementations work without error, but the lazy_dt(df) version fails, and therefore requiring one of two (or more?) odd-looking workarounds:

lazy_dt(df) %>%
  group_by(par, cat) %>%
  mutate(acute = EMA(val, ratio = 2 / (1 + 7))) %>%
  mutate(chron = case_when(
    rep(cat == "A", n()) ~ EMA(val, ratio = 2 / (1 + 42)),
    rep(cat == "B", n()) ~ EMA(val, ratio = 2 / (1 + 28)),
    rep(cat %in% c("C", "D", "E"), n()) ~ EMA(val, ratio = 2 / (1 + 14))
  ))

lazy_dt(df) %>%
  group_by(par, cat) %>%
  mutate(cat2 = cat, acute = EMA(val, ratio = 2 / (1 + 7))) %>%
  mutate(chron = case_when(
    cat2 == "A" ~ EMA(val, ratio = 2 / (1 + 42)),
    cat2 == "B" ~ EMA(val, ratio = 2 / (1 + 28)),
    cat2 %in% c("C", "D", "E") ~ EMA(val, ratio = 2 / (1 + 14))
  ))
r2evans commented 1 year ago

Erp, didn't initially realize the data.table version brought fcase inside the call to EMA, side-stepping the problem. The data.table-native problem shows the same error when the calls to EMA are inside the call to fcase.