Closed LordRudolf closed 4 years ago
I think that using an arbitrary constant (like zero or -9999) is a really bad idea that can be implemented easily in step_mutate()
.
Although i agree with topepo, both in the sense that it is often a bad idea that it is likely simple enough to do through add_role
and step_mutate
or step_mutate_at
implementing step_constantimpute
is rather simple following the Create your own recipe step function guide, and it might be useful in edge cases where missing variables by definition means equal to a certain value. For example the Finnish Statistics bureau has statistics for various ZIP codes, but does the datasets usually only has positive values setting any 0 value as missing (although this data should be a 0).
If for some odd reason one wanted to have a specific step that is more "explicitly" saying that it imputes a constant value (making the readability argument), a step could be made such as the one below
#' Impute a constant value to correct for missingness
#' @param recipe
#' @param ...
#' @param role
#' @param trained
#' @param constant A numeric value or a possibly named vecotr vector with same length as the result from the selector functions, specifying which constant should be imputed for each selector
#' @param id
#'
#' @description This method can be used for constant imputation.
#'
#' @examples
#' data(airquality)
#' library(recipes)
#' library(dplyr)
#' # Only Solar.R and Ozone has missing values
#' tibble(airquality)
#' # We can specify by roles, predictors etc.
#' # Use a named vector (or list) to specify what value to impute.
#' airquality_rec <- recipe(airquality, Solar.R ~ .) %>%
#' step_constantimpute(all_predictors(), all_outcomes(),
#' constant = c(Solar.R = 0,
#' Ozone = 30,
#' Wind = 0,
#' Temp = 50,
#' Month = 1,
#' Day = 7)) %>%
#' prep() %>%
#' juice()
#' airquality_rec
#'
#' # If every column should be imputed with the same value,
#' # we can simple use a single value for constant (default = 0)
#' recipe(airquality, Solar.R ~ .) %>%
#' step_constantimpute(all_predictors(), all_outcomes(),
#' constant = 0) %>%
#' prep() %>% juice() %>%
#' print()
#'
#' @note Be careful when using constants for imputing values in any dataset.
#' While this seems like a simple fix, it has many drawbacks.
#' In most parametric models (including (generalized) linear regression) and
#' non-parametric models (including SVM, neural networks etc.) this tends to cause
#' a reduction in variance and explanatory power by the imputed variables.
#' Certain models (including tree-based models) are less affected by this, but
#' one should nontheless be aware of potential drawbacks and possibly explore
#' other options before using constant imputation.
#'
#' @export
step_constantimpute <- function(
recipe,
...,
role = NA,
trained = FALSE,
constant = 0,
skip = FALSE,
id = rand_id("constantimpute")){
# Import terms
terms <- ellipse_check(...)
# Check that "constant" is the correct format (first vector, then numeric)
if(is.list(constant)){
if(any(lengths(constant) != 1))
rlang::abort('One or more elements of `constant` has a length different from 1.\n`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
constant <- unlist(constant, recursive = TRUE, use.names = TRUE)
}
# After conversion from list, the constant should still be a single numric value
if(!is.numeric(constant))
rlang::abort('`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
add_step(
recipe,
step(
subclass = "constantimpute",
terms = terms,
trained = trained,
role = role,
constant = constant,
skip = skip,
id = id
)
)
}
prep.step_constantimpute <- function(x, training, info = NULL, ...){
# Import the names that should be transformed.
col_names <- terms_select(terms = x$terms, info = info)
# Make sure all types are numeric
recipes::check_type(training %>% select(col_names), quant = TRUE)
if(!is.null(nm <- names(x$constant))){
if(any(!nm %in% col_names))
rlang::abort('`constant` has more elements then specified by the selector functions.')
if(any(!col_names %in% nm))
rlang::abort('One or more columns are missing from named `constant` vector.')
}else{
if((n <- length(x$constant)) != 1 && n != length(col_names))
rlang::abort('`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
else if(n == 1)
x$constant <- rep(x$constant, length(col_names))
names(x$constant) <- col_names
}
step(
subclass = "constantimpute",
terms = x$terms,
role = x$role,
trained = TRUE,
constant = x$constant,
skip = x$skip,
id = x$id
)
}
bake.step_constantimpute <- function(object, new_data, ...){
# Import the variables that should be baked
vars <- names(object$constant)
# Iterate over each variable and impute the constant.
for(i in vars){
isn <- is.na(new_data[[i]])
new_data[[i]][isn] <- object$constant[[i]]
}
# Return the result as a tibble.
tibble::as_tibble(new_data)
}
@topepo , considering that @Bijaelo has provided additional arguments for introducing constant imputation and he has prepared ready-to-paste functions, can it be added to the recipes ?
I will note that I do not necesssarily think that it should be included in the package, but one could easily create a recipe_extensions
package and include it there. In general I believe extending the formula
interface is time better spent.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
There is a need for NA values imputation with a constant. For some models constant number imputation usually would work better than median or other type of imputations. For example, in neural network models, NA value replacement with zeros works well as the nodes with input values equal to 0 do not send any further signals. And, in the tree-based models, some extreme value imputations (such as "-99999") is suitable as the tree models will make separate splits for these extreme values.
Please, can a constant number imputation step be implemented in the R recipes?
Thanks.