Closed thierrygosselin closed 7 years ago
Hello Thierry
The algorithm tries to take into account all statistical associations between all variables. So, at least in theory, the answer will be positive. In practice, if you have e.g. too little data or if the values are not missing at random, then it does not work too well in general.
Let us see what happens to our iris data:
set.seed(398745)
# Replace some values by NA
iris2 <- iris
iris2$Sepal.Length[sample(150, 20)] <- NA
iris2$Sepal.Width[sample(150, 40)] <- NA
table(is.na(iris2$Sepal.Length), is.na(iris2$Sepal.Width))
# Output
FALSE TRUE
FALSE 94 36
TRUE 16 4
So there are 20 missing values in Sepal.Length
and 40 in Sepal.Width
.
Now let's fill those values again by running
iris3 <- missRanger(iris2, pmm = 3, seed = 3483)
and compare the joint distribution of the two variables stratified by Species (= color) in the original data set (left) and after imputation (right).
par(mfrow = 1:2)
plot(Sepal.Length ~ Sepal.Width, data = iris, col = Species, main = "original")
plot(Sepal.Length ~ Sepal.Width, data = iris3, col = Species, main = "imputed")
Of course, the pictures are not identical, but the structure seems to be retained.
Related to this, check out what this guy does for the iris dataset... http://www.markvanderloo.eu/yaRb/2016/09/13/announcing-the-simputation-package-make-imputation-simple/
Quick question Michael...
Scenario where you have more than 1 response variable missing:
e.g. with the iris dataset let say
Sepal.Length
andSepal.Width
are missing we know that both of these values are correlated together withSpecies
.Your implementation imputes by column, is the correlation between columns is still accounted for in the model ? Because, we don't want to have imputed values that taken together after imputations don't "fit" the species...
Best, Thierry