mayer79 / missRanger

Fast multivariate imputation by random forests.
https://mayer79.github.io/missRanger/
GNU General Public License v2.0
63 stars 11 forks source link

Question #3

Closed thierrygosselin closed 7 years ago

thierrygosselin commented 7 years ago

Quick question Michael...

Scenario where you have more than 1 response variable missing:

e.g. with the iris dataset let say Sepal.Length and Sepal.Width are missing we know that both of these values are correlated together with Species.

Your implementation imputes by column, is the correlation between columns is still accounted for in the model ? Because, we don't want to have imputed values that taken together after imputations don't "fit" the species...

Best, Thierry

mayer79 commented 7 years ago

Hello Thierry

The algorithm tries to take into account all statistical associations between all variables. So, at least in theory, the answer will be positive. In practice, if you have e.g. too little data or if the values are not missing at random, then it does not work too well in general.

Let us see what happens to our iris data:

set.seed(398745)
# Replace some values by NA
iris2 <- iris
iris2$Sepal.Length[sample(150, 20)] <- NA
iris2$Sepal.Width[sample(150, 40)] <- NA
table(is.na(iris2$Sepal.Length), is.na(iris2$Sepal.Width))

# Output
       FALSE TRUE
  FALSE    94   36
  TRUE     16    4

So there are 20 missing values in Sepal.Length and 40 in Sepal.Width.

Now let's fill those values again by running

  iris3 <- missRanger(iris2, pmm = 3, seed = 3483)

and compare the joint distribution of the two variables stratified by Species (= color) in the original data set (left) and after imputation (right).

par(mfrow = 1:2)
plot(Sepal.Length ~ Sepal.Width, data = iris, col = Species, main = "original")
plot(Sepal.Length ~ Sepal.Width, data = iris3, col = Species, main = "imputed")

grafik

Of course, the pictures are not identical, but the structure seems to be retained.

thierrygosselin commented 7 years ago

Related to this, check out what this guy does for the iris dataset... http://www.markvanderloo.eu/yaRb/2016/09/13/announcing-the-simputation-package-make-imputation-simple/