mayer79 / missRanger

Fast multivariate imputation by random forests.
https://mayer79.github.io/missRanger/
GNU General Public License v2.0
63 stars 11 forks source link

Question on PMM #7

Closed thierrygosselin closed 6 years ago

thierrygosselin commented 6 years ago

Hi Michael, my question is regarding PMM.

e.g. a dataset with 10000 variable with different level of missingness Is their a potential for bias if PMM is carried out after the model for one variable ?

Since the knn will be only between that variable's values and not accounting all the variable. If all variable were accounted for distance, neighbours would be different, I suppose...

and not related to missRanger, I see you've started working with LightGBM, have you tried imputations with it ?

Best Thierry

mayer79 commented 6 years ago

Hello Thierry Great to hear from you. How is your package developing?

On LightGBM: I don't think that gradient boosters are very helpful. The reason is that they have about ten parameters that have to be chosen carefully. Unlike random forests, which usually work acceptably well even without any tuning. But for other purposes, LightGBM is ingenious.

On PMM: I am not sure if I got your question. The idea is as follows: With k-PMM, the missing value in variable x and observation i is not directly filled in by the OOB prediction of the random forest. Instead, the OOB prediction of observation i is compared with all OOB predictions of observations without missing x. Among the k nearest OOB predictions, an observation is picked at random. Then their x value is used for imputation. Since the OOB predictions are a function of all variables (except x), actually the match is done implicitly on all variables (except x), not unlike propensity score matching.

Let me give you an example

library(missRanger)
crazyData <- data.frame(x1 = c(NA, 1, 1, 1, 1, 2, 2, 2, 2), x2 = c(1, 2, 3, 2, 3, 5, 6, 5, 6))
filledData <- missRanger(crazyData, pmm.k = 1)
filledData

#   x1 x2
# 1  1  1
# 2  1  2
# 3  1  3
# 4  1  2
# 5  1  3
# 6  2  5
# 7  2  6
# 8  2  5
# 9  2  6

The first observation's x2 value is close to the x2 value of observations 2-4. Thus, their OOB predictions for x1 will be quite close. Consequently, the first value is picked with one of their x1 values (which is 1).

thierrygosselin commented 6 years ago

Hi Michael, thanks for the quick reply, very helpful ! For the package, we are now working on simulated genomic data to test different imputation methods. So far I want to have 4-5 methods to test, including missRanger and on-the-fly imputation proposed in randomForestSRC. I was thinking of integrating your PMM approach after lightGBM and XGBoost... but like you said, there are numerous arguments to tune and not as simple as RF approaches...

Best Thierry

mayer79 commented 6 years ago

Wow - I am very much looking forward seeing the results! My PMM code is actually very hard to read, but only because I wanted to be able to deal with categorical variables. For purely numeric data, it is actually very much simpler.

thierrygosselin commented 6 years ago

it's fine, and the data is categorical.

The problem faced with imputation and RADseq data is with low frequency genotypes.

The problem I was first facing with XGBoost or LightGBM is that I have to use a training/test set by first splitting my non missing genotypes then I do the imputations.

In missRanger::pmm the xtrain argument requires to run the prediction model back to all the data (training + test set).

The xtest argument is the imputed data generated by the model prediction.

The couple of tests conducted shows an increase in variance. More low frequency genotypes are reintroduced in the imputed data, which is good. Otherwise, they were dropped by the model.