Closed jburos closed 4 years ago
I think that you can do what you are interested in without modifying existing functionality (or adding more). For example, for your second counterfactual, I believe that you can get the desired effect using factor()
:
library(tidyverse)
training_data <-
tibble::tibble(
treatment = sample(c('drug-a', 'drug-b'), size = 20, replace = T),
age = runif(n = 20, min = 45, max = 80),
country = sample(
c('USA', 'China', 'France'),
size = 20,
replace = T
),
sex = sample(c('m', 'f'), size = 20, replace = T)
)
training_data %>%
# convert all characters to factors, to retain their levels
dplyr::mutate_if(is.character, factor) %>%
# set up the prediction scenario
dplyr::filter(treatment == 'drug-b') %>%
dplyr::mutate(treatment = factor(treatment, levels = rev(levels(treatment)))) %>%
str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 12 obs. of 4 variables:
#> $ treatment: Factor w/ 2 levels "drug-b","drug-a": 1 1 1 1 1 1 1 1 1 1 ...
#> $ age : num 59.4 61.5 52.9 79.8 46.3 ...
#> $ country : Factor w/ 3 levels "China","France",..: 1 2 1 2 2 3 3 2 1 1 ...
#> $ sex : Factor w/ 2 levels "f","m": 2 2 2 2 1 2 1 2 1 2 ...
Created on 2020-02-28 by the reprex package (v0.3.0)
The first counterfactual you lose the factor levels because you explictly redefine as a character vector:
dplyr::mutate(treatment = 'drug-a')
instead of
dplyr::mutate(treatment = factor('drug-a', levels = levels(treatment))
For the case with:
dplyr::mutate(treatment = forcats::fct_recode(treatment, 'drug-a' = 'drug-b'))
you could also do:
dplyr::mutate(
reversed = ifelse(treatment == 'drug-a', 'drug-b', 'drug-a'),
treatment = factor(reversed, levels = levels(treatment))
)
For new factor levels, most models don't react well to new factor levels to a data set after the model has been fitted:
library(tidyverse)
training_data <-
tibble::tibble(
treatment = sample(c('drug-a', 'drug-b'), size = 20, replace = T),
age = runif(n = 20, min = 45, max = 80),
country = sample(
c('USA', 'China', 'France'),
size = 20,
replace = T
),
sex = sample(c('m', 'f'), size = 20, replace = T)
)
smol_iris <- iris %>%
slice(1:99) %>%
mutate(Species = fct_drop(Species))
lm_fit <- lm(Sepal.Length ~ ., data = smol_iris)
predict(lm_fit, iris[100:105,])
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Species has new levels virginica
Created on 2020-02-28 by the reprex package (v0.3.0)
An option like .allow_new_levels
makes sense if you give it a new level prior to model building. That's what recipes::step_novel()
does.
Thanks @topepo - yes this is what I do in practice. I use some variation of this when changing values of a factor:
dplyr::mutate(
reversed = ifelse(treatment == 'drug-a', 'drug-b', 'drug-a'),
treatment = factor(reversed, levels = levels(treatment))
)
although it's often referencing the levels in the original data, ie: treatment = factor(reversed, levels = levels(training_data$treatment))
, and similarly mirroring the is-ordered attribute from the previous factor.
I haven't found a way to work recipes
into this workflow yet; this is something I'll investigate further.. I almost never want new levels since as you point out they will break most scenarios for this training/test use case.
But I agree this usage is generally simple enough that it may not warrant a new function. Thanks for the examples π
This request is very related to this question on the community board: https://community.rstudio.com/t/recoding-factors-using-if-else/14014
It has two parts to it -
Let's say I have some data on which I want to fit a model, and then I want to modify this existing data to support a prediction scenario. In the course of this preparation, I want/need to ensure that the factor levels and ideally their orders haven't changed compared to the original dataset. If I retain the factor levels, then my model.matrix construction -- and any ancillary mapping from values of fields to vectors of parameters -- will be consistent.
To be specific, let's say I have some training data for a set of patients in a clinical trial:
I typically convert all characters to factors in order to retain their "training" levels, using
mutate_if
.Let's say I now want to generate a prediction for a counterfactual-of-drugB scenario - IE the subset of patients who received drug B, but modified so that the data look like they had received drug-a.
Ideally, I would do something like this:
Or,
Right now, I have a work-around that can "apply" factor levels & labels from the training data to the final prediction dataset, but being able to modify values of factors seems like a core function & so I thought I would post here.
Proposal
I don't have a great proposal for what this solution looks like, but one option might be to have a function with signature:
fct_modify(.factor, .x, .allow_new_levels = TRUE)
This would allow for a natural syntax in the simple case:
But could allow more complicated workflows, IE :
Within the function, it would operate much like
as_factor
except that it would apply factor levels & check any logic against the.factor
object before returning.