Closed thierrygosselin closed 6 years ago
Hello Thierry Great to hear from you. How is your package developing?
On LightGBM: I don't think that gradient boosters are very helpful. The reason is that they have about ten parameters that have to be chosen carefully. Unlike random forests, which usually work acceptably well even without any tuning. But for other purposes, LightGBM is ingenious.
On PMM: I am not sure if I got your question. The idea is as follows: With k-PMM, the missing value in variable x and observation i is not directly filled in by the OOB prediction of the random forest. Instead, the OOB prediction of observation i is compared with all OOB predictions of observations without missing x. Among the k nearest OOB predictions, an observation is picked at random. Then their x value is used for imputation. Since the OOB predictions are a function of all variables (except x), actually the match is done implicitly on all variables (except x), not unlike propensity score matching.
Let me give you an example
library(missRanger)
crazyData <- data.frame(x1 = c(NA, 1, 1, 1, 1, 2, 2, 2, 2), x2 = c(1, 2, 3, 2, 3, 5, 6, 5, 6))
filledData <- missRanger(crazyData, pmm.k = 1)
filledData
# x1 x2
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 2
# 5 1 3
# 6 2 5
# 7 2 6
# 8 2 5
# 9 2 6
The first observation's x2 value is close to the x2 value of observations 2-4. Thus, their OOB predictions for x1 will be quite close. Consequently, the first value is picked with one of their x1 values (which is 1).
Hi Michael, thanks for the quick reply, very helpful ! For the package, we are now working on simulated genomic data to test different imputation methods. So far I want to have 4-5 methods to test, including missRanger and on-the-fly imputation proposed in randomForestSRC. I was thinking of integrating your PMM approach after lightGBM and XGBoost... but like you said, there are numerous arguments to tune and not as simple as RF approaches...
Best Thierry
Wow - I am very much looking forward seeing the results! My PMM code is actually very hard to read, but only because I wanted to be able to deal with categorical variables. For purely numeric data, it is actually very much simpler.
it's fine, and the data is categorical.
The problem faced with imputation and RADseq data is with low frequency genotypes.
The problem I was first facing with XGBoost or LightGBM is that I have to use a training/test set by first splitting my non missing genotypes then I do the imputations.
In missRanger::pmm the xtrain
argument requires to run the prediction model back to all the data (training + test set).
The xtest
argument is the imputed data generated by the model prediction.
The couple of tests conducted shows an increase in variance. More low frequency genotypes are reintroduced in the imputed data, which is good. Otherwise, they were dropped by the model.
Hi Michael, my question is regarding PMM.
e.g. a dataset with 10000 variable with different level of missingness Is their a potential for bias if PMM is carried out after the model for one variable ?
Since the knn will be only between that variable's values and not accounting all the variable. If all variable were accounted for distance, neighbours would be different, I suppose...
and not related to missRanger, I see you've started working with LightGBM, have you tried imputations with it ?
Best Thierry