Closed shah-in-boots closed 4 years ago
I think it's unlikely that we'll work on setting up functions for this kind of "parallel" or "sequential" modeling to support within workflows itself, but the pieces of workflows are very flexible and composable and lend themselves to you building up these kinds of model in a fluent way.
Instead of a formula, think about using a recipe or the new-ish add_variables()
function, where you can supply a vector of predictors. For example, you could set up a "sequential" set of model for all the predictors in mtcars (cyl
, then cyl + disp
, then cyl + disp + hp
, etc) like this:
library(tidymodels)
library(vctrs)
#>
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:tibble':
#>
#> data_frame
#> The following object is masked from 'package:dplyr':
#>
#> data_frame
outcome <- "mpg"
predictors <- setdiff(names(mtcars), outcome)
lm_spec <- linear_reg() %>% set_engine("lm")
## make a little function to create a workflow with `mpg` as outcome and our set of predictors
wf_seq <- function(preds) {
workflow() %>%
add_model(lm_spec) %>%
add_variables(outcomes = mpg, predictors = !!preds)
}
## set up the "sequential" set of predictors and create each workflow, then fit
tibble(num_preds = 1:length(predictors)) %>%
mutate(preds = map(num_preds, ~vec_slice(predictors, 1:.))) %>%
mutate(wf = map(preds, wf_seq),
fitted_wf = map(wf, fit, mtcars))
#> # A tibble: 10 x 4
#> num_preds preds wf fitted_wf
#> <int> <list> <list> <list>
#> 1 1 <chr [1]> <workflow> <workflow>
#> 2 2 <chr [2]> <workflow> <workflow>
#> 3 3 <chr [3]> <workflow> <workflow>
#> 4 4 <chr [4]> <workflow> <workflow>
#> 5 5 <chr [5]> <workflow> <workflow>
#> 6 6 <chr [6]> <workflow> <workflow>
#> 7 7 <chr [7]> <workflow> <workflow>
#> 8 8 <chr [8]> <workflow> <workflow>
#> 9 9 <chr [9]> <workflow> <workflow>
#> 10 10 <chr [10]> <workflow> <workflow>
Created on 2020-11-12 by the reprex package (v0.3.0.9001)
The workflows in the wf
column are unfitted, and the ones in fitted_wf
are fitted. You could do something similar for what you are calling "parallel" models, or with sets of outcomes and predictors. I don't think we are going to directly support this kind of modeling with a function directly, but we do have the infrastructure here that allows you to flexibly put together what you are wanting to do.
I think this is an excellent response - with an excellent example. I had done something similar with mutate and broom for nested datasets to get repeat analyses, but this is much more flexible in that I can build a modeling "tibble" to describe how the models should be arranged, and workflows
allows me to put together any number of pre-specified models from parsnip
.
Thanks for this answer! I'm closing the "issue" with this comment.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Feature
I'm wondering if there is a place for a feature to pre-specify a moderate number of models, and then run these hypotheses with the intent of keeping every specified model. I come from a clinical/epidemiological context, and although routinely work with models that are not aimed to be tuned or adjusted to achieve a certain threshold for being an effective model. This includes negative results. I also work with many outcomes and predictors that are interpretable (thus feature reduction is not as helpful).
I have been using the
tidymodels
approach to do a majority of my modeling, but I one issue I kept coming up against is that I have to respecify my formulas over and over. There are probably many other methodologies for clinical/epi modeling, but two general concepts that I use are:*this is not referencing computational requirements of sequential v. parallel processing
For an analysis that had >5-6 outcomes, >10-12 predictors, there would be dozens of models. The goal would be to simplify this process. I wrote a rudimentary function
build_models()
to solve this problem for myself, which worked great for straightforward models (e.g.lm()
). I then added the option to work with linear mixed models, and circular statistical models, by allowing myself to specify a "model engine".Thinking about how simplified
tidymodels
has made most regression packages, I was thinking that there must be a way to specify aworkflow()
that allows the aggregation of multiple formulas with model specifications and allows them to be run together.I thought to see if this is an idea that perhaps could use model specifications from
parsnip
, and perhaps create a type of specific workflow to fit a fair number of models at once. For example, the formula could be...The intent here would be to have a workflow that is specified such that the appropriate models are built.
To give an output of specific sequentially built models for both outcomes using the 4 predictors. Additional features or options could be exposures held constant between models
But likely these would be based on the underlying models. I think this would really only apply for the more traditional "supervised learning" with linear/logistic/polynomial/harmonic regressions.