tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
568 stars 112 forks source link

Replace missing values with a constant #473

Closed LordRudolf closed 4 years ago

LordRudolf commented 4 years ago

There is a need for NA values imputation with a constant. For some models constant number imputation usually would work better than median or other type of imputations. For example, in neural network models, NA value replacement with zeros works well as the nodes with input values equal to 0 do not send any further signals. And, in the tree-based models, some extreme value imputations (such as "-99999") is suitable as the tree models will make separate splits for these extreme values.

Please, can a constant number imputation step be implemented in the R recipes?

Thanks.

topepo commented 4 years ago

I think that using an arbitrary constant (like zero or -9999) is a really bad idea that can be implemented easily in step_mutate().

Bijaelo commented 4 years ago

Although i agree with topepo, both in the sense that it is often a bad idea that it is likely simple enough to do through add_role and step_mutate or step_mutate_at implementing step_constantimpute is rather simple following the Create your own recipe step function guide, and it might be useful in edge cases where missing variables by definition means equal to a certain value. For example the Finnish Statistics bureau has statistics for various ZIP codes, but does the datasets usually only has positive values setting any 0 value as missing (although this data should be a 0).

If for some odd reason one wanted to have a specific step that is more "explicitly" saying that it imputes a constant value (making the readability argument), a step could be made such as the one below

#' Impute a constant value to correct for missingness
#' @param recipe
#' @param ...
#' @param role 
#' @param trained
#' @param constant A numeric value or a possibly named vecotr vector with same length as the result from the selector functions, specifying which constant should be imputed for each selector
#' @param id
#' 
#' @description This method can be used for constant imputation. 
#' 
#' @examples 
#' data(airquality)
#' library(recipes)
#' library(dplyr)
#' # Only Solar.R and Ozone has missing values
#' tibble(airquality)
#' # We can specify by roles, predictors etc.
#' # Use a named vector (or list) to specify what value to impute.
#' airquality_rec <- recipe(airquality, Solar.R ~ .) %>%
#'   step_constantimpute(all_predictors(), all_outcomes(), 
#'                       constant = c(Solar.R = 0, 
#'                                    Ozone = 30, 
#'                                    Wind = 0, 
#'                                    Temp = 50, 
#'                                    Month = 1, 
#'                                    Day = 7)) %>%
#'   prep() %>%
#'   juice() 
#' airquality_rec
#' 
#' # If every column should be imputed with the same value, 
#' # we can simple use a single value for constant (default = 0)
#' recipe(airquality, Solar.R ~ .) %>%
#'   step_constantimpute(all_predictors(), all_outcomes(), 
#'                       constant = 0) %>% 
#'   prep() %>% juice() %>%
#'   print() 
#' 
#' @note Be careful when using constants for imputing values in any dataset. 
#' While this seems like a simple fix, it has many drawbacks. 
#' In most parametric models (including (generalized) linear regression) and
#' non-parametric models (including SVM, neural networks etc.) this tends to cause
#' a reduction in variance and explanatory power by the imputed variables. 
#' Certain models (including tree-based models) are less affected by this, but
#' one should nontheless be aware of potential drawbacks and possibly explore 
#' other options before using constant imputation.
#' 
#' @export
step_constantimpute <- function(
    recipe, 
    ..., 
    role = NA, 
    trained = FALSE,
    constant = 0,
    skip = FALSE,
    id = rand_id("constantimpute")){
  # Import terms
  terms <- ellipse_check(...) 
  # Check that "constant" is the correct format (first vector, then numeric)
  if(is.list(constant)){
    if(any(lengths(constant) != 1))
      rlang::abort('One or more elements of `constant` has a length different from 1.\n`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
    constant <- unlist(constant, recursive = TRUE, use.names = TRUE)
  }
  # After conversion from list, the constant should still be a single numric value
  if(!is.numeric(constant))
    rlang::abort('`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.') 
  add_step(
    recipe, 
    step(
      subclass = "constantimpute", 
      terms = terms,
      trained = trained,
      role = role,
      constant = constant,
      skip = skip,
      id = id
    )
  )
}
prep.step_constantimpute <- function(x, training, info = NULL, ...){
  # Import the names that should be transformed.
  col_names <- terms_select(terms = x$terms, info = info) 
  # Make sure all types are numeric
  recipes::check_type(training %>% select(col_names), quant = TRUE)
  if(!is.null(nm <- names(x$constant))){
   if(any(!nm %in% col_names))
     rlang::abort('`constant` has more elements then specified by the selector functions.')
    if(any(!col_names %in% nm))
      rlang::abort('One or more columns are missing from named `constant` vector.')
  }else{
    if((n <- length(x$constant)) != 1 && n != length(col_names))
      rlang::abort('`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.') 
    else if(n == 1)
      x$constant <- rep(x$constant, length(col_names))
    names(x$constant) <- col_names
  }
  step(
    subclass = "constantimpute", 
    terms = x$terms,
    role = x$role,
    trained = TRUE,
    constant = x$constant,
    skip = x$skip,
    id = x$id
  )
}
bake.step_constantimpute <- function(object, new_data, ...){
  # Import the variables that should be baked
  vars <- names(object$constant)
  # Iterate over each variable and impute the constant. 
  for(i in vars){
    isn <- is.na(new_data[[i]])
    new_data[[i]][isn] <- object$constant[[i]]
  }
  # Return the result as a tibble.
  tibble::as_tibble(new_data)
}
LordRudolf commented 4 years ago

@topepo , considering that @Bijaelo has provided additional arguments for introducing constant imputation and he has prepared ready-to-paste functions, can it be added to the recipes ?

Bijaelo commented 4 years ago

I will note that I do not necesssarily think that it should be included in the package, but one could easily create a recipe_extensions package and include it there. In general I believe extending the formula interface is time better spent.

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.