tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

`case_when()` coerces the multplicity of a returned value to match that from conditions not satisfied #7088

Closed ChrisHIV closed 1 month ago

ChrisHIV commented 1 month ago

case_when() appears to first decide on the multiplicity of values it will return based on considering all conditions and then coerce the value specified by the satisfied condition into this multiplicity, rather than returning multiplicity of values specified by the satisfied condition. Is this a bug or intentional? It leads to unexpected behaviour when trying to use conditions related to multiplicity of values.

Example

library(dplyr)
tibble(x = c("A", "A", "B", "B"),
       y = c("I", "I", "J", "K")) %>%
  summarise(.by = x,
            summary = case_when(
              length(unique(y)) == 1L ~ paste("This x has one y:", unique(y)),
              TRUE ~ "This x has several ys"
            ))

output:

  x     summary              
  <chr> <chr>                
1 A     This x has one y: I  
2 B     This x has several ys
3 B     This x has several ys

together with a warning message about returning more (or less) than 1 row per summarise() group.

output I expected

  x     summary              
  <chr> <chr>                
1 A     This x has one y: I  
2 B     This x has several ys
DavisVaughan commented 1 month ago

I think you really just need to use a basic if statement, you aren't doing anything vectorized so you don't need case-when

library(dplyr)

do_it <- function(y) {
  if (length(unique(y)) == 1L) {
    paste("This x has one y:", unique(y))
  } else {
    "This x has several ys"
  }
}

tibble(x = c("A", "A", "B", "B"),
       y = c("I", "I", "J", "K")) %>%
  summarise(.by = x,
            summary = do_it(y))
#> # A tibble: 2 × 2
#>   x     summary              
#>   <chr> <chr>                
#> 1 A     This x has one y: I  
#> 2 B     This x has several ys

A simpler example of what you are trying to demonstrate is:

dplyr::case_when(
  FALSE ~ c(1, 2),
  TRUE ~ 3
)
#> [1] 3 3

I actually think this should be an error. See https://github.com/tidyverse/dplyr/issues/7082#issuecomment-2334173589 where I talk about this in more detail. The RHSs of case_when() should either have size 1 or size size where size comes from the size of the things on the LHS. The underlying engine already throws an error here:

dplyr:::vec_case_when(
  conditions = list(FALSE, TRUE),
  values = list(c(1, 2), 3)
)
#> Error in `dplyr:::vec_case_when()`:
#> ! `values[[1]]` must have size 1, not size 2.

But anyways, for your use case of having 2 conditions that basically amount to:

I think you are much better served by an if statement