tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

Implement case_when #38

Open hadley opened 4 years ago

romainfrancois commented 3 years ago

Related: imported issue https://github.com/tidyverse/dplyr/issues/5871 by @jzadra:

Having used case_when() for years, I was quite worried today when I discovered that case_when() handles NA values in the test differently than if_else() and ifelse():

image

After discovering this behavior, I see that there is a note about this buried in the case_when() documentation examples section:

# Note that NA values in the vector x do not get special treatment. If you want
# to explicitly handle NA values you can use the `is.na` function:
x[2:4] <- NA_real_
case_when(
  x %% 35 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 7 == 0 ~ "buzz",
  is.na(x) ~ "nope",
  TRUE ~ as.character(x)
)

So now I know, but polling my team members at work reveals that none of them were aware of this behavior either.

My assumption has been that case_when()handles NA values in the test the same way if_else() does and return an NA UNLESS the NA values are specifically dealt with, rather than the other way around where they get the TRUE value if left unspecified. Consider that a basic test like NA == TRUE yields NA, and NA == NA also yields NA, further adding to my feeling that this is unexpected behavior: NA test values should yield an NA result for the TRUE ~ x line in case_when().

But regardless of whether this behavior is better for some reason (I'd love to understand if so!), it's obviously way too late to change it. I think it might be a good idea to include a warning/message when NAs exist in the tests but are treated as TRUE, ie "NA values in the data were handled by TRUE ~. Did you mean to specify a case for NA values?". There is precedent for this in other functions such as converting a character vector of mostly numeric values when there are some non-numeric values.

romainfrancois commented 3 years ago

(moved here from https://github.com/tidyverse/dplyr/issues/5876)

case_when() and if_else() perform type-checks and it is for the best in most cases.

However, some data type are usually fully exchangeable, e.g. integers and doubles.

Therefore, you can experience some unwanted type-checking that makes little to no sense:

library(tidyverse)

df = read.table(header=TRUE, text="
  id check count_x
1  8     0       0
2  8     1       1
3  8     1       2
8  8     0       3
9  8     0       3")

case_when(
  df$count_x==0 ~ df$check,
  df$count_x>0 ~ 1,
)
#> Error: must be an integer vector, not a double vector.

if_else(df$count_x==0, df$check, 1)
#> Error: `false` must be an integer vector, not a double vector.

Created on 2021-05-06 by the reprex package (v2.0.0)

Of course, casting the variable to plain numeric will work, using as.numeric() or simply +0, but it seems quite unnecessary:

case_when(
  df$count_x==0 ~ df$check+0,
  df$count_x>0 ~ 1,
)
#> [1] 0 1 1 1 1

Created on 2021-05-06 by the reprex package (v2.0.0)

Maybe the type-checking could be a bit more flexible in some cases, such as this one.

eutwt commented 2 years ago

feature request: check the size of condition vectors and warn or error if they are not all equal, possibly with an exception for the last case so that a default can be specified by putting TRUE ~ 'default_val' at the end.

For reference, here's how data.table::fcase handles conditions with differing lengths:

data.table::fcase(
  c(FALSE, FALSE), 'a',
  TRUE, 'b',
  c(FALSE, TRUE), 'c'
)
#> Error in data.table::fcase(c(FALSE, FALSE), "a", TRUE, "b", c(FALSE, TRUE), : Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.

Created on 2021-10-12 by the reprex package (v2.0.1)

Currently dplyr::case_when silently recycles:

dplyr::case_when(
  c(FALSE, FALSE) ~ 'a',
  TRUE ~ 'b',
  c(FALSE, TRUE) ~ 'c'
)
#> [1] "b" "b"

Created on 2021-10-12 by the reprex package (v2.0.1)

hadley commented 2 years ago

@eutwt I think case_when() would be consistent with the standard tidyverse recycling rules, which would not yield an error here.