spsanderson / healthyR.ai

healthyR.ai - AI package for the healthyverse
http://www.spsanderson.com/healthyR.ai/
Other
16 stars 6 forks source link

hai_glmnet_data_prepper() #247

Closed spsanderson closed 2 years ago

spsanderson commented 2 years ago

Function:

#' Prep Data for glmnet - Recipe
#'
#' @family Preprocessor
#' @family knn
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details This function will automatically prep your data.frame/tibble for
#' use in the glmnet algorithm. It expects data to be presented in a certain fashion.
#'
#' This function will output a recipe specification.
#'
#' @description Automatically prep a data.frame/tibble for use in the glmnet algorithm.
#'
#' @param .data The data that you are passing to the function. Can be any type
#' of data that is accepted by the `data` parameter of the `recipes::reciep()`
#' function.
#' @param .recipe_formula The formula that is going to be passed. For example
#' if you are using the `iris` data then the formula would most likely be something
#' like `Species ~ .`
#'
#' @examples
#' hai_glmnet_data_prepper(.data = Titanic, .recipe_formula = Survived ~ .)
#' rec_obj <- hai_glmnet_data_prepper(Survived ~ ., Titanic)
#' get_juiced_data(rec_obj)
#'
#' @return
#' A recipe object
#'
#' @export
#'

hai_glmnet_data_prepper <- function(.data, .recipe_formula){

  # Recipe ---
  rec_obj <- recipes::recipe(.recipe_formula, data = .data) %>% 
    ## For modeling, it is preferred to encode qualitative data as factors 
    ## (instead of character). 
    recipes::step_string2factor(tidyselect::vars_select_helpers$where(is.character)) %>% 
    recipes::step_novel(recipes::all_nominal_predictors()) %>% 
    ## This model requires the predictors to be numeric. The most common 
    ## method to convert qualitative predictors to numeric is to create 
    ## binary indicator variables (aka dummy variables) from these 
    ## predictors. 
    recipes::step_dummy(recipes::all_nominal_predictors()) %>% 
    ## Regularization methods sum up functions of the model slope 
    ## coefficients. Because of this, the predictor variables should be on 
    ## the same scale. Before centering and scaling the numeric predictors, 
    ## any predictors with a single unique value are filtered out. 
    recipes::step_zv(recipes::all_predictors()) %>% 
    recipes::step_normalize(recipes::all_numeric_predictors()) 

  # Return ----
  return(rec_obj)

}

Example:

> hai_glmnet_data_prepper(.data = Titanic, .recipe_formula = Survived ~ .)
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Operations:

Factor variables from tidyselect::vars_select_helpers$where(is.character)
Novel factor level assignment for recipes::all_nominal_predictors()
Dummy variables from recipes::all_nominal_predictors()
Zero variance filter on recipes::all_predictors()
Centering and scaling for recipes::all_numeric_predictors()
> rec_obj <- hai_glmnet_data_prepper(Survived ~ ., Titanic)
> get_juiced_data(rec_obj)
# A tibble: 32 x 7
        n Survived Class_X2nd Class_X3rd Class_Crew Sex_Male Age_Child
    <dbl> <fct>         <dbl>      <dbl>      <dbl>    <dbl>     <dbl>
 1 -0.506 No           -0.568     -0.568     -0.568    0.984     0.984
 2 -0.506 No            1.70      -0.568     -0.568    0.984     0.984
 3 -0.248 No           -0.568      1.70      -0.568    0.984     0.984
 4 -0.506 No           -0.568     -0.568      1.70     0.984     0.984
 5 -0.506 No           -0.568     -0.568     -0.568   -0.984     0.984
 6 -0.506 No            1.70      -0.568     -0.568   -0.984     0.984
 7 -0.381 No           -0.568      1.70      -0.568   -0.984     0.984
 8 -0.506 No           -0.568     -0.568      1.70    -0.984     0.984
 9  0.362 No           -0.568     -0.568     -0.568    0.984    -0.984
10  0.627 No            1.70      -0.568     -0.568    0.984    -0.984
# ... with 22 more rows