tidyverse / forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
https://forcats.tidyverse.org/
Other
554 stars 126 forks source link

Add function that creates factor in order of case_when matches #298

Open dchiu911 opened 2 years ago

dchiu911 commented 2 years ago

A common workflow I do is map one vector to another using some (possibly complex) conditions, then coerce to a factor with the level order the same as parsed in dplyr::case_when(). It would be helpful if there was a wrapper that created the factor without having to manually specify the levels. Currently, I'd do something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(2022)
x <- sample(
  c("low", "intermediate", "high"),
  prob = c(0.5, 0.2, 0.3),
  size = 100,
  replace = TRUE
)
z <- rbinom(
  n = 100,
  size = 100,
  prob = 0.3
)
y <- case_when(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_
) %>%
  factor(levels = c("B", "A", "C"))
str(y)
#>  Factor w/ 3 levels "B","A","C": 1 3 2 3 2 3 2 1 1 3 ...

Created on 2022-02-01 by the reprex package (v2.0.1)

Can we add a function that makes y into a factor with the level order the same as specified in the case_when()? For example,

y <- fct_case(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_
)
hadley commented 2 years ago

I think we'd need to make the syntax more limiting than case_when() because the RHS of a case_when() can itself use data values, and reasoning through how those values should interact between conditions seems hard.

Since we'd want to restrict each expression to a single character level, we could put it in the LHS of =, something like:

something(
  "B" = x == "intermediate" | (x == "low" & z < 30),
  "A" = x == "low",
  "C" = x == "high",
)

But I don't know if any existing tidyverse function uses similar syntax.

dchiu911 commented 2 years ago

I do think removing the usage of ~ would make it more consistent as case_when() syntax is quite unique

DavisVaughan commented 2 years ago

But I don't know if any existing tidyverse function uses similar syntax.

FWIW this is basically how fct_recode() works (name represents new level, value was the old level), so it wouldn't be unheard to let the name represent the new level, and the value be the logical condition

hadley commented 1 year ago

Will wait until lower level functions are exposed by vctrs.

brianmsm commented 11 months ago

I would think it would be convenient to solve this from the case_when() itself:

Something like this:

set.seed(2022)
x <- sample(
  c("low", "intermediate", "high"),
  prob = c(0.5, 0.2, 0.3),
  size = 100,
  replace = TRUE
)
z <- rbinom(
  n = 100,
  size = 100,
  prob = 0.3
)
y <- case_when(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_,
  .ptype = "factor"
)