tidyverse / forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
https://forcats.tidyverse.org/
Other
554 stars 127 forks source link

Feature request: modify values of a factor without changing levels #253

Closed jburos closed 4 years ago

jburos commented 4 years ago

This request is very related to this question on the community board: https://community.rstudio.com/t/recoding-factors-using-if-else/14014

It has two parts to it -

  1. a desire to update the data for a factor variable naturally, as you would for a character or other field, while also
  2. retaining existing factor levels

Let's say I have some data on which I want to fit a model, and then I want to modify this existing data to support a prediction scenario. In the course of this preparation, I want/need to ensure that the factor levels and ideally their orders haven't changed compared to the original dataset. If I retain the factor levels, then my model.matrix construction -- and any ancillary mapping from values of fields to vectors of parameters -- will be consistent.

To be specific, let's say I have some training data for a set of patients in a clinical trial:

library(tidyverse)
training_data <- tibble::tibble(treatment = sample(c('drug-a', 'drug-b'), size = 20, replace = T),
                                age = runif(n = 20, min = 45, max = 80),
                                country = sample(c('USA', 'China', 'France'), size = 20, replace = T),
                                sex = sample(c('m','f'), size = 20, replace = T))
glimpse(training_data)
#> Observations: 20
#> Variables: 4
#> $ treatment <chr> "drug-b", "drug-a", "drug-b", "drug-a", "drug-b", "dru…
#> $ age       <dbl> 79.38279, 47.21515, 59.21253, 59.10700, 79.81601, 72.7…
#> $ country   <chr> "USA", "USA", "China", "France", "USA", "China", "Fran…
#> $ sex       <chr> "f", "f", "m", "m", "f", "f", "m", "m", "f", "f", "m",…  

I typically convert all characters to factors in order to retain their "training" levels, using mutate_if.

training_data %>% 
     dplyr::mutate_if(is.character, factor)

Let's say I now want to generate a prediction for a counterfactual-of-drugB scenario - IE the subset of patients who received drug B, but modified so that the data look like they had received drug-a.

Ideally, I would do something like this:

counterfactual_for_drug_b <- training_data %>% 
     dplyr::mutate_if(is.character, factor) %>%
     # set up the prediction scenario
     dplyr::filter(treatment == 'drug-b') %>%
     dplyr::mutate(treatment = 'drug-a')
# However, I lost my factor levels.
str(counterfactual_for_drug_b$treatment)
#> chr [1:8] "drug-a" "drug-a" "drug-a" "drug-a" "drug-a" "drug-a" "drug-a" "drug-a"

Or,

counterfactual_for_drug_b <- training_data %>% 
     # convert all characters to factors, to retain their levels
     dplyr::mutate_if(is.character, factor) %>%
     # set up the prediction scenario
     dplyr::filter(treatment == 'drug-b') %>%
     dplyr::mutate(treatment = forcats::fct_recode(treatment, 'drug-a' = 'drug-b'))
# still lose my factor levels.
str(counterfactual_for_drug_b$treatment)
#>  Factor w/ 1 level "drug-a": 1 1 1 1 1 1 1 1

Right now, I have a work-around that can "apply" factor levels & labels from the training data to the final prediction dataset, but being able to modify values of factors seems like a core function & so I thought I would post here.

Proposal

I don't have a great proposal for what this solution looks like, but one option might be to have a function with signature: fct_modify(.factor, .x, .allow_new_levels = TRUE)

This would allow for a natural syntax in the simple case:

counterfactual_for_drug_b <- training_data %>% 
     # convert all characters to factors, to retain their levels
     dplyr::mutate_if(is.character, factor) %>%
     # set up the prediction scenario
     dplyr::filter(treatment == 'drug-b') %>%
     dplyr::mutate(treatment = forcats::fct_modify(treatment, 'drug-a', .allow_new_levels = FALSE))
str(counterfactual_for_drug_b$treatment)
#> $ treatment: Factor w/ 2 levels "drug-a","drug-b":  1 1 1 1 1 1 1 1 1 1 ...

But could allow more complicated workflows, IE :

counterfactual_for_drug_b <- training_data %>% 
     # convert all characters to factors, to retain their levels
     dplyr::mutate_if(is.character, factor) %>%
     # set up the prediction scenario
     dplyr::filter(treatment == 'drug-b') %>%
     dplyr::mutate(treatment = forcats::fct_modify(treatment, dplyr::case_when(...))

Within the function, it would operate much like as_factor except that it would apply factor levels & check any logic against the .factor object before returning.

topepo commented 4 years ago

I think that you can do what you are interested in without modifying existing functionality (or adding more). For example, for your second counterfactual, I believe that you can get the desired effect using factor():

library(tidyverse)
training_data <-
  tibble::tibble(
    treatment = sample(c('drug-a', 'drug-b'), size = 20, replace = T),
    age = runif(n = 20, min = 45, max = 80),
    country = sample(
      c('USA', 'China', 'France'),
      size = 20,
      replace = T
    ),
    sex = sample(c('m', 'f'), size = 20, replace = T)
  )

training_data %>% 
  # convert all characters to factors, to retain their levels
  dplyr::mutate_if(is.character, factor) %>%
  # set up the prediction scenario
  dplyr::filter(treatment == 'drug-b') %>%
  dplyr::mutate(treatment = factor(treatment, levels = rev(levels(treatment)))) %>% 
  str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    12 obs. of  4 variables:
#>  $ treatment: Factor w/ 2 levels "drug-b","drug-a": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ age      : num  59.4 61.5 52.9 79.8 46.3 ...
#>  $ country  : Factor w/ 3 levels "China","France",..: 1 2 1 2 2 3 3 2 1 1 ...
#>  $ sex      : Factor w/ 2 levels "f","m": 2 2 2 2 1 2 1 2 1 2 ...

Created on 2020-02-28 by the reprex package (v0.3.0)

The first counterfactual you lose the factor levels because you explictly redefine as a character vector:

dplyr::mutate(treatment = 'drug-a')

instead of

dplyr::mutate(treatment = factor('drug-a', levels = levels(treatment))

For the case with:

dplyr::mutate(treatment = forcats::fct_recode(treatment, 'drug-a' = 'drug-b'))

you could also do:

dplyr::mutate(
  reversed = ifelse(treatment == 'drug-a', 'drug-b', 'drug-a'),
  treatment = factor(reversed, levels = levels(treatment))
  )

For new factor levels, most models don't react well to new factor levels to a data set after the model has been fitted:

library(tidyverse)
training_data <-
  tibble::tibble(
    treatment = sample(c('drug-a', 'drug-b'), size = 20, replace = T),
    age = runif(n = 20, min = 45, max = 80),
    country = sample(
      c('USA', 'China', 'France'),
      size = 20,
      replace = T
    ),
    sex = sample(c('m', 'f'), size = 20, replace = T)
  )

smol_iris <- iris %>% 
  slice(1:99) %>% 
  mutate(Species = fct_drop(Species))

lm_fit <- lm(Sepal.Length ~ ., data = smol_iris)

predict(lm_fit, iris[100:105,])
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Species has new levels virginica

Created on 2020-02-28 by the reprex package (v0.3.0)

An option like .allow_new_levels makes sense if you give it a new level prior to model building. That's what recipes::step_novel() does.

jburos commented 4 years ago

Thanks @topepo - yes this is what I do in practice. I use some variation of this when changing values of a factor:

dplyr::mutate(
  reversed = ifelse(treatment == 'drug-a', 'drug-b', 'drug-a'),
  treatment = factor(reversed, levels = levels(treatment))
  )

although it's often referencing the levels in the original data, ie: treatment = factor(reversed, levels = levels(training_data$treatment)), and similarly mirroring the is-ordered attribute from the previous factor.

I haven't found a way to work recipes into this workflow yet; this is something I'll investigate further.. I almost never want new levels since as you point out they will break most scenarios for this training/test use case.

But I agree this usage is generally simple enough that it may not warrant a new function. Thanks for the examples πŸ™