tidymodels / workflows

Modeling Workflows
https://workflows.tidymodels.org/
Other
204 stars 21 forks source link

add variables and special model formulas #34

Closed topepo closed 3 years ago

topepo commented 4 years ago

With upcoming hierarchical models, GAMS, and others, we need to make the workflow interface smoother.

Currently, it is not intuitive in a few ways:

Historically, the model formula has always done many things: specify the variables in the model, create encodings for them, and then hand them off to the model with the appropriate analysis roles (e.g. outcome, predictor, etc).

Example

For example, if there was a parsnip hierarchical model to fit via stan or lme4, a user's initial stab would be:

library(tidymodels)

data(sleepstudy, package = "lme4")

mod <- linear_reg() %>% set_engine("stan glmer")

wflow_0 <- 
  workflow() %>% 
  # Won't work since the basic formula method makes dummy variables
  add_formula(Reaction ~ Days + (Days || Subject)) %>% 
  add_model(mod)

fit() will generate the error:

Error in Days || Subject : invalid 'y' type in 'x || y'

(which could be better)

Looking around, the formula argument to add_model() is found:

wflow_1 <- 
  workflow() %>% 
  # Make a simple formula for processing the data 
  add_formula(Reaction ~ Days + Subject) %>% 
  # Then add another formula to give to the model: 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

That ends in an error of

Error in eval(predvars, data, env) : object 'Subject' not found 

because add_formula() makes dummy variables.

Current solution

After searching a lot more, there are two options that are kludgy but work:

bp <- hardhat::default_formula_blueprint(indicators = FALSE)
wflow_2 <- 
  workflow() %>% 
  add_formula(Reaction ~ ., blueprint = bp) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

wflow_3 <- 
  workflow() %>% 
  add_recipe(recipe(Reaction ~ ., data = sleepstudy)) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

We can make this interface a lot better and intuitive.

Proposals

Some straw-man proposals:

First, let's make a function where users can tell the model what data to use, and maybe their limited roles, without doing any pre-processing:

wflow_4 <- 
  workflow() %>% 
  # Add in the data by processing through only `model.frame()` or equivalent. 
  # No other in-line functions used; just as-is:
  add_variables_asis(Reaction ~ .) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

Having two formulas might be confusing. Basic tidyselect tools could be used instead:

wflow_5 <- 
  workflow() %>% 
  # If formulas are confusing, we could use tidyselect functions
  add_variables(one_of(Reaction, Days, Subject)) %>% 
  add_model(mod, formula = Reaction ~ Days + (Days || Subject))

Even though the endpoint could be achieved using current code, the existing methods are not intuitive and also not well documented in workflows.

Second, even though the model formula is tied to the model, it might be better to have a separate add function that attaches a model formula to a model specification:

wflow_6 <- 
  workflow() %>% 
  add_variables(one_of(Reaction, Days, Subject)) %>% 
  add_model(mod) %>% 
  add_model_formula(Reaction ~ Days + (Days || Subject))

A few people might want to add input: @jaredlander, @beckmart, @monicathieu, @billdenney, @emitanaka, and @Athanasiamo

EmilHvitfeldt commented 4 years ago

I like the second proposal (wflow_6). I think it feels better to separate the model and the fit.

sidenote: When making these changes we need to make sure they also work with Surv() for survival regression model.

wflow_surv <- 
  workflow() %>% 
  add_variables(one_of(time, status, x)) %>% 
  add_model(mod) %>% 
  add_model_formula(Surv(time, status) ~ x)
monicathieu commented 4 years ago

Of the straw man proposals you've provided, I think I +1 on wflow_6.

A question I have: At least in lme4, I'm pretty sure it creates two model matrices, one for the across-units/fixed effects, and one for the within-units/random effects. If the issue with wflow_0 is that the basic formula method makes dummy variables in a way that's inconsistent with what, say, lme4 expects, could an alternate method be something like the below?

wflow_0prime <- 
  workflow() %>% 
  add_formula(Reaction ~ Days) %>% 
  add_formula_special(Days | Subject) %>%
  add_model(mod)

where the hierarchical model term gets treated differently to possibly get around the dummy variable method issue. Has the added UI benefit of getting people to think in a more compartmental manner about the different sections of their model.

billdenney commented 4 years ago

I also like wflow_6.

I especially like @monicathieu's suggestion for something like add_formula_special() or something similar which would add to a normal formula interface the parts which are specific to the model type. The challenge with add_formula_special() is that as written, the model would not know how to use the "special" part of the formula.

The model type is added after, so the object would not know that it was a random effect model.

Tangentially related to this, @topepo, you may want to check out my formulops package which allows many types of modifications of formula as though it were standard math (as opposed to statistical math for linear effects). It also introduces the concept of a substituting formula which can make substitutions into existing formulae easier:

library(formulops)

formula_sub <-
  substituting_formula(
    a~b,
    b~c*d,
    c~e/(f+g),
    d~h|j
  )
as.formula(formula_sub)
#> a ~ e/(f + g) * (h | j)

Created on 2020-04-10 by the reprex package (v0.3.0)

topepo commented 4 years ago

Does anyone have a preference between these two approaches?

  1. add_variables_asis(formula)
  2. add_variables(selectors)

The main downsides to each are:

jaredlander commented 4 years ago

I think I like add_variables() better. What if it had arguments for inputs and outputs?

topepo commented 4 years ago

What if it had arguments for inputs and outputs?

Probably not. I'd err on the side of simplicity.

emitanaka commented 4 years ago

For mixed models, you can specify the fixed and random effects via formulae but there needs to be another step in which you define the covariance structure.

workflow() %>% 
  add_formula(Reaction ~ Days + Subject + Subject:Days) %>% 
  set_covariance(~Subject + Subject:Days, ~us:id) 

I'm not saying the above is the way to go, but when I write the model equation mathematically, I'll write the model and then define the covariance structure (see image below from this paper). So it makes sense that a prototype workflow will have some resemblance to what I do mathematically.

Alternatively, thinking about separating fixed and random effects, although it feels too verbose to me (and lme4 doesn't support flexible covariance structure via formula anyhow).

workflow() %>% 
  set_mean_formula(Reaction ~ Days) %>% 
  set_random_formula(Subject + Subject:Days) %>%
  set_covariance(~Subject + Subject:Days, ~us:id) 

symbolic_lmm

drmowinckels commented 4 years ago

I think there are some excellent suggestions here. I also think workflow 6 looks the most promising, and prefer add_variables with selectors. To me, this feels the most intuitive, and I think I could more easily teach the logic of that to others.

I think also @emitanaka makes a very good point regarding the covariance structure specifications. While I have not personally used this option, I know many researchers who actively do and would need to specify it. Having a good think around achieving that now rather than later I think will benefit development. I believe having this option would also cover the random vs fixed effects model matrices that @monicathieu discusses.

topepo commented 4 years ago

For the extra specifications (like covariance structures in the asreml example), those would just come along for the ride; we just want to make sure that the right data are there with the right encoding.

I don't know that we would want to have workflow functions for every type of ancillary formula; those would/could be given to set_engine() (but now that I write that, it seems pretty kludgy). I'll think that over a bit.

mdancho84 commented 4 years ago

@topepo Chiming in because I'm running into a similar issue (Issue #39) in an attempt to extend parsnip to time series. My issue is related to Date and Datetime variables being converted to numeric during the workflow fitting process.

While I have no comments on your proposed solution, I do think a simple indicators = FALSE is a nice way to handle in parsnip. My solution was to create a fit.arima_reg() function that wraps parnsip::fit.model_spec(indicators = FALSE).

I tried the workflows::add_formula(blueprint = hardhat::default_formula_blueprint(indicators = FALSE)). It didn't work for me.

I'm very interested in getting this resolved because this minor issue is really a pain for the workflows & tuning ARIMA models will be super beneficial once it gets resolved.

mdancho84 commented 4 years ago

Thinking about this a little further. workflows() has a fit.workflow() method. For consistency with parsnip, we should be able to provide fit.workflow(indicators = FALSE) so it operates the same as fit.model_spec(indicators = FALSE) and performs no preprocessing / dummy variables internally.

What do you think of that option?

jaredlander commented 4 years ago

Sitting in the Stan workshop right now and they just covered stan_glm() and stan_glmner(). Where in the {parsnip} workflow would the priors go? In linear_reg()? In set_engine()?

juliasilge commented 4 years ago

The priors go in set_engine(); you can see that in action here on the new tidymodels.org site.

jaredlander commented 4 years ago

Thanks @juliasilge

jaredlander commented 4 years ago

Haven't been able to read that yet, so I'm just saying this out loud, it will be cool if we can tune over both the choice of prior and the parameters to the prior.

jaredlander commented 4 years ago

A few more thoughts came to mind.

First, vocab. The model formula can be thought of as a relationship. So instead of add_model_formula(), maybe add_relationship() or some similar verb-noun combination?

Second, roles, specifically for multilevel models. The grouping variables have a role, so maybe they are best specified in the recipe. But that does lead to the question of how to define which variables are subject level vs group level, unless there can be a nice way to put that in the recipe too.

Third, tuning. The way the formula is built is part of the tuning process. Which variables are subject level and which are group level. Which variables get a smoothing term and which are linear in GAMs. So people will want to try numerous formulations. How do they "tune" over different formulations?

topepo commented 4 years ago

Which variables get a smoothing term and which are linear in GAMs. So people will want to try numerous formulations. How do they "tune" over different formulations?

Specifically for GAMs, the first iteration of that might follow caret's lead: equal smoothing for all parameters (unless otherwise specified). I might need to make version zero untunable and let people specify it in the formula. For example:

y ~ s(a, df = 2) + s(b) + c

The formulas are a bit of an api nightmare for the developer. I'd like to be able to allow the user to use:

y ~ s(a, df = tune("a")) + s(b, df = tune("b")) + c

but that is pretty hard to do.

mdancho84 commented 4 years ago

@topepo - I just shipped modeltime 0.0.1 to CRAN. I expect it to hit CRAN this week. https://github.com/business-science/modeltime

I'll be following this issue, and just keep me posted on the indicators and set_encodings().

Once parsnip and workflows are updated, I'll remove the special fit() methods to get around date conversion to numeric.

DavisVaughan commented 3 years ago

Closed in #68

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.