topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

Imputing mixed numeric/categorical data within train preProc? #1344

Open jarbet opened 1 year ago

jarbet commented 1 year ago

Is it possible to impute mixed numeric/categorical data within train's preProc argument? I want to impute within train's cross validation, thereby accounting for how uncertainty in imputations affects estimation of generalization error.

The ?preProcess help page suggests it is not possible to impute categorical variables:

x : a matrix or data frame. Non-numeric predictors are allowed but will be ignored.

However, the bagImpute method can handle mixed data, in theory. The following code runs, but I am not sure if it is actually imputing the missing factor or simply removing patients with missing factor values:

library(caret);
#> Loading required package: ggplot2
#> Loading required package: lattice
data(iris);

nrow(iris);
#> [1] 150

iris.miss <- iris;
iris.miss[1,'Species'] <- NA;
iris.miss[2,'Petal.Length'] <- NA;
set.seed(1);
fit <- train(
    Sepal.Length ~ .,
    data = iris.miss,
    method = 'lm',
    preProc = 'bagImpute',
    na.action = na.pass
    );
fit
#> Linear Regression 
#> 
#> 150 samples
#>   4 predictor
#> 
#> Pre-processing: bagged tree imputation (5) 
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 
#> Resampling results:
#> 
#>   RMSE       Rsquared   MAE      
#>   0.3176759  0.8587222  0.2604171
#> 
#> Tuning parameter 'intercept' was held constant at a value of TRUE

Notice the printed fit says that all 150 patients were included, thus suggesting the missing factor was imputed, although I suspect that patient is simply being removed from the model and not imputed?

Created on 2023-07-23 by the reprex package (v2.0.1)