Open hadley opened 4 years ago
(moved here from https://github.com/tidyverse/dplyr/issues/5876)
case_when()
and if_else()
perform type-checks and it is for the best in most cases.
However, some data type are usually fully exchangeable, e.g. integers and doubles.
Therefore, you can experience some unwanted type-checking that makes little to no sense:
library(tidyverse)
df = read.table(header=TRUE, text="
id check count_x
1 8 0 0
2 8 1 1
3 8 1 2
8 8 0 3
9 8 0 3")
case_when(
df$count_x==0 ~ df$check,
df$count_x>0 ~ 1,
)
#> Error: must be an integer vector, not a double vector.
if_else(df$count_x==0, df$check, 1)
#> Error: `false` must be an integer vector, not a double vector.
Created on 2021-05-06 by the reprex package (v2.0.0)
Of course, casting the variable to plain numeric will work, using as.numeric()
or simply +0
, but it seems quite unnecessary:
case_when(
df$count_x==0 ~ df$check+0,
df$count_x>0 ~ 1,
)
#> [1] 0 1 1 1 1
Created on 2021-05-06 by the reprex package (v2.0.0)
Maybe the type-checking could be a bit more flexible in some cases, such as this one.
feature request: check the size of condition vectors and warn or error if they are not all equal, possibly with an exception for the last case so that a default can be specified by putting TRUE ~ 'default_val'
at the end.
For reference, here's how data.table::fcase
handles conditions with differing lengths:
data.table::fcase(
c(FALSE, FALSE), 'a',
TRUE, 'b',
c(FALSE, TRUE), 'c'
)
#> Error in data.table::fcase(c(FALSE, FALSE), "a", TRUE, "b", c(FALSE, TRUE), : Argument #3 has a different length than argument #1. Please make sure all logical conditions have the same length.
Created on 2021-10-12 by the reprex package (v2.0.1)
Currently dplyr::case_when
silently recycles:
dplyr::case_when(
c(FALSE, FALSE) ~ 'a',
TRUE ~ 'b',
c(FALSE, TRUE) ~ 'c'
)
#> [1] "b" "b"
Created on 2021-10-12 by the reprex package (v2.0.1)
@eutwt I think case_when()
would be consistent with the standard tidyverse recycling rules, which would not yield an error here.
Related: imported issue https://github.com/tidyverse/dplyr/issues/5871 by @jzadra:
Having used
case_when()
for years, I was quite worried today when I discovered thatcase_when()
handles NA values in the test differently thanif_else()
andifelse()
:After discovering this behavior, I see that there is a note about this buried in the
case_when()
documentation examples section:So now I know, but polling my team members at work reveals that none of them were aware of this behavior either.
My assumption has been that
case_when()
handles NA values in the test the same wayif_else()
does and return anNA
UNLESS the NA values are specifically dealt with, rather than the other way around where they get theTRUE
value if left unspecified. Consider that a basic test likeNA == TRUE
yieldsNA
, andNA == NA
also yieldsNA
, further adding to my feeling that this is unexpected behavior: NA test values should yield an NA result for theTRUE ~ x
line incase_when()
.But regardless of whether this behavior is better for some reason (I'd love to understand if so!), it's obviously way too late to change it. I think it might be a good idea to include a warning/message when NAs exist in the tests but are treated as TRUE, ie "NA values in the data were handled by TRUE ~. Did you mean to specify a case for NA values?". There is precedent for this in other functions such as converting a character vector of mostly numeric values when there are some non-numeric values.