tidymodels / workflows

Modeling Workflows
https://workflows.tidymodels.org/
Other
207 stars 23 forks source link

Pre-specified models with multiple outcomes and predictors #85

Closed shah-in-boots closed 4 years ago

shah-in-boots commented 4 years ago

Feature

I'm wondering if there is a place for a feature to pre-specify a moderate number of models, and then run these hypotheses with the intent of keeping every specified model. I come from a clinical/epidemiological context, and although routinely work with models that are not aimed to be tuned or adjusted to achieve a certain threshold for being an effective model. This includes negative results. I also work with many outcomes and predictors that are interpretable (thus feature reduction is not as helpful).

I have been using the tidymodels approach to do a majority of my modeling, but I one issue I kept coming up against is that I have to respecify my formulas over and over. There are probably many other methodologies for clinical/epi modeling, but two general concepts that I use are:

  1. Sequential model building, such that y ~ x1, y ~ x1 + x2 are created, each with an intentional reason why a variable was added in what order. The effect size and change in effect size with additional predictors is helpful at understanding causality.
  2. Parallel model building*, such that y ~ x1 and y ~ x2 are created, usually to demonstrate the individual effects in parallel. This allows to see the relative effects in a transparent way.

*this is not referencing computational requirements of sequential v. parallel processing

For an analysis that had >5-6 outcomes, >10-12 predictors, there would be dozens of models. The goal would be to simplify this process. I wrote a rudimentary function build_models() to solve this problem for myself, which worked great for straightforward models (e.g. lm()). I then added the option to work with linear mixed models, and circular statistical models, by allowing myself to specify a "model engine".

build_models <- function(formula, data, type, engine = "linear", exposure = NULL) {

  # Get terms
  o <- all.vars(lme4::nobars(formula)[[2]])
  p <- all.vars(lme4::nobars(formula)[[3]])
  m <- lme4::findbars(formula) # This finds if there is a mixed effect model or note
  no <- length(o)
  np <- length(p)
  nm <- length(m)
  mixed <- ifelse(nm >= 1, TRUE, FALSE)

  # Type of model to build
  l <- list()
  switch(
    type,
    parallel = {
      for(i in 1:no) {
        for(j in 1:np) {
          # Add mixed effects here
          if(mixed) {
            mix <- paste0("(", m, ")", collapse = " + ")
            predictors <- paste0(p[j], " + ", mix)
          } else {
            predictors <- p[j]
          }

          # Create formulas
          f <- stats::formula(paste0(o[[i]], " ~ ", predictors))
          l[[o[[i]]]][[j]] <- f
        }
      }
    },
    sequential = {
      # Ensure exposure is maintained if sequential
      if(!is.null(exposure)) {
        p <- p[-(which(p == exposure))]
        p <- c(exposure, p)
      }

      # Creating formulas
      for(i in 1:no) {
        for(j in 1:np) {
          # Add mixed effects here if needed
          predictors <- paste0(p[1:j], collapse = " + ")
          if(mixed) {
            mix <- paste0("(", m, ")", collapse = " + ")
            f <- stats::formula(paste0(o[[i]], " ~ ", predictors, " + ", mix))
          } else {
            f <- stats::formula(paste0(o[[i]], " ~ ", predictors))
          }

          # Save them
          l[[o[[i]]]][[j]] <- f
        }
      }
    }
  )

  models <- list()
  switch(
    engine,
    linear = {
      for(i in 1:no) {
        for(j in 1:np) {
          if(mixed) {
            m <- lme4::lmer(formula = l[[i]][[j]], data = data)
            models[[o[[i]]]][[j]] <- broom.mixed::tidy(m, conf.int = TRUE)
          } else {
            m <- stats::lm(formula = l[[i]][[j]], data = data)
            models[[o[[i]]]][[j]] <- broom::tidy(m, conf.int = TRUE)
          }
        }
      }
    },
    circular = {
      for(i in 1:no) {
        for(j in 1:np) {
          # Generate data
          f <- l[[i]][[j]]
          mat <- stats::model.frame(f, data = data)
          x <- stats::model.matrix(f, data = mat)
          y <- mat[[sym(o[[i]])]]
          m <- circular::lm.circular(y = y, x = x, type = "c-l", init = rep(0, ncol(x)), tol = 1e-3, verbose = FALSE)

          # Tidy it (assuming intercept is first)
          models[[o[[i]]]][[j]] <- tidy.circular(m, conf.int = TRUE)

        }
      }
    }
  )
  # Create the models using hte formulas in `l`

  # Tidy it up
  res <-
    dplyr::as_tibble(models) %>%
    tidyr::pivot_longer(
      col = tidyr::everything(),
      names_to = "outcomes",
      values_to = "models"
    ) %>%
    dplyr::mutate(covar = purrr::map_dbl(models, nrow) - 1) %>%
    tidyr::unnest(cols = "models")

  # Return
  return(res)
}

Thinking about how simplified tidymodels has made most regression packages, I was thinking that there must be a way to specify a workflow() that allows the aggregation of multiple formulas with model specifications and allows them to be run together.

I thought to see if this is an idea that perhaps could use model specifications from parsnip, and perhaps create a type of specific workflow to fit a fair number of models at once. For example, the formula could be...

f <- plant_height + greenness ~ soil + rainfall + co2 + sunshine
rec <- recipe(f, data = x) %>%
  step_normalize(all_predictors())

The intent here would be to have a workflow that is specified such that the appropriate models are built.

# Type of models
lm_mods <- linear_reg()

# Theoretical workflow
workflows() %>%
  add_recipe(rec) %>%
  add_type_of_modeling("sequential") %>%
  add_model(lm_mods)

To give an output of specific sequentially built models for both outcomes using the 4 predictors. Additional features or options could be exposures held constant between models

But likely these would be based on the underlying models. I think this would really only apply for the more traditional "supervised learning" with linear/logistic/polynomial/harmonic regressions.

juliasilge commented 4 years ago

I think it's unlikely that we'll work on setting up functions for this kind of "parallel" or "sequential" modeling to support within workflows itself, but the pieces of workflows are very flexible and composable and lend themselves to you building up these kinds of model in a fluent way.

Instead of a formula, think about using a recipe or the new-ish add_variables() function, where you can supply a vector of predictors. For example, you could set up a "sequential" set of model for all the predictors in mtcars (cyl, then cyl + disp, then cyl + disp + hp, etc) like this:

library(tidymodels)
library(vctrs)
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:tibble':
#> 
#>     data_frame
#> The following object is masked from 'package:dplyr':
#> 
#>     data_frame

outcome <- "mpg"
predictors <- setdiff(names(mtcars), outcome)

lm_spec <- linear_reg() %>% set_engine("lm")

## make a little function to create a workflow with `mpg` as outcome and our set of predictors
wf_seq <- function(preds) {
  workflow() %>%
    add_model(lm_spec) %>%
    add_variables(outcomes = mpg, predictors = !!preds)
}

## set up the "sequential" set of predictors and create each workflow, then fit
tibble(num_preds = 1:length(predictors)) %>%
  mutate(preds     = map(num_preds, ~vec_slice(predictors, 1:.))) %>%
  mutate(wf        = map(preds, wf_seq),
         fitted_wf = map(wf, fit, mtcars))
#> # A tibble: 10 x 4
#>    num_preds preds      wf         fitted_wf    
#>        <int> <list>     <list>     <list>    
#>  1         1 <chr [1]>  <workflow> <workflow>
#>  2         2 <chr [2]>  <workflow> <workflow>
#>  3         3 <chr [3]>  <workflow> <workflow>
#>  4         4 <chr [4]>  <workflow> <workflow>
#>  5         5 <chr [5]>  <workflow> <workflow>
#>  6         6 <chr [6]>  <workflow> <workflow>
#>  7         7 <chr [7]>  <workflow> <workflow>
#>  8         8 <chr [8]>  <workflow> <workflow>
#>  9         9 <chr [9]>  <workflow> <workflow>
#> 10        10 <chr [10]> <workflow> <workflow>

Created on 2020-11-12 by the reprex package (v0.3.0.9001)

The workflows in the wf column are unfitted, and the ones in fitted_wf are fitted. You could do something similar for what you are calling "parallel" models, or with sets of outcomes and predictors. I don't think we are going to directly support this kind of modeling with a function directly, but we do have the infrastructure here that allows you to flexibly put together what you are wanting to do.

shah-in-boots commented 4 years ago

I think this is an excellent response - with an excellent example. I had done something similar with mutate and broom for nested datasets to get repeat analyses, but this is much more flexible in that I can build a modeling "tibble" to describe how the models should be arranged, and workflows allows me to put together any number of pre-specified models from parsnip.

Thanks for this answer! I'm closing the "issue" with this comment.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.