stekhoven / missForest

missForest is a nonparametric, mixed-type imputation method for basically any type of data for the statistical software R.
http://stat.ethz.ch/CRAN/web/packages/missForest/index.html
88 stars 23 forks source link

How do I use missForest to impute NAs in test data? #12

Closed abhiML closed 5 years ago

abhiML commented 6 years ago

We can basically use missForest package for imputing missing values in R(for both categorical and numeric).But this approach requires a complete response variable for training the forest. So,how to impute missing values in the test data set using this missForest package ,because we do not have any response variable in the test data set?

stephematician commented 6 years ago

Imputation is not prediction, you may have the two confused.

It does not matter whether a variable is a predictor or a response; they are almost always treated the same in imputation procedures. missForest will replace all NA regardless of whether you identify it as a predictor or a response.

I recommend having a look through http://www.stefvanbuuren.name/fimd

abhiML commented 6 years ago

No you didnt get the question I think. So say I use missForest to impute the missing values in my train set. Now my test data comes along with a single datapoint : x1 x2 x3 x4 6.7 g NA 9 Where x1 is numeric, x2 is categorical, x3 is numeric and x4 is integer. My point is how do I use those trees which I trained for imputation in the training set in order to impute the NA value in the test data?

stephematician commented 6 years ago

Training and test data are usually terms applied to prediction problems, not imputation.

If you want to train a classifier on some data that may have missing values and you then want to test the performance of that classifier on some data that may also have missing values, then maybe you want to encode missing values as a special category and train as per usual (for non-categorical predictors, it is less obvious how to encode the missing value)

missForest and other iterative (multiple) imputation procedures are typically used to generate a complete data set which can then be used in further statistical analyses. They are not really intended for prediction as you seem to be alluding to.