statistikat / VIM

Visualization and Imputation of Missing Values
http://statistikat.github.io/VIM/
83 stars 15 forks source link

New imputation framework #75

Open alexkowa opened 1 year ago

alexkowa commented 1 year ago

looking at #73 and #74 maybe what we really should do is consolidating it into a new function and deprecating some functions (irmi, rangerImpute and regressionImpute) ? @matthias-da @GregorDeCillia @JohannesGuss

# model based imputation framework function
# generally idea function that incorporates a framework for model based imputation. 
# Options should be
# - sequential modelling
# - using PMM or not
# - bootstrap the model error
# - drawing the predicted value from the "posterior" distribution
# - model options: (robust) regression, ranger, XGBoost, some kind of transformer model

vimpute <- function(data,
                    variable = colnames(data),
                    sequential = FALSE,
                    bootstrap = FALSE
                    pmm_k=NULL,# if integer value use kNN on predicted values
                    xvar = colnames(data),
                    model = c("robust","regression","ranger","XGBoost","GPT")
                    formula = NULL, # possibililty to override the indivual models
                    imp_var=TRUE,imp_suffix="imp", 
                    verbose = FALSE){

}
matthias-da commented 1 year ago

Yeah, it would be great to discuss this.

There are some difficulties/complexity when doing sequential imputation with different kinds of variables such as numeric and categorical. Sometimes other methods (or parametrisation) are used depending on if a variable is categorical or numeric (and in irmi we have also semi-continous and count variables considered). So, like in mice::mice there must be then default methods for different kind variables in a data set. The argument method can then be either a single string, or a vector of strings with for each variable.

When bootstrapping one has to ensure that all categories are actually sampled in a factor variable for certain imputation methods (otherwise: error). A Bayesian Bootstrap as a way out or tricking the factor levels?

I like the idea of a robust bootstrap just because if robust then one should also care that accidentially not considerable more outliers are sampled than in the original data, but its not straightforward to be used when having a mix of different scaled variables. The idea is to divide the observations into strata depending on their "outlyingness" and sample from each strata independently.

# model based imputation framework function
# generally idea function that incorporates a framework for model based imputation. 
# Options should be
# - sequential modelling
# - using PMM or not
# - bootstrap the model error
# - drawing the predicted value from the "posterior" distribution
# - model options: (robust) regression, ranger, XGBoost, some kind of transformer model
# - complex formula (with `formula`) for each variable possible. 

vimpute <- function(data,
                    variable = colnames(data),
                    sequential = TRUE,
                    modeluncertainty = "robustBootstrap" (choices: "none", "bootstrap", "robustBootrap-stratified", "robustBootstrap-xyz", "BayesianBootstrap") 
# how to best deal with it that each method has own parameters?
                    imputationuncertainty = "PMM" (choices: "PMM", "midastouch", "normal", "residual", "pmm_k=NULL,# if integer value use kNN on predicted values. Should be either of length one or length of number of variables?
                    xvar = colnames(data), # delete this?
                    method = c("lm", "regression","ranger","XGBoost","GPT") # here I would use default methods for each kind of variable - as in mice - that one can override. Supported methods: "lm", "MM", "ranger", "XGBoost", "GPT", "gam", "robGam")
                    formula = NULL, # possibililty to override the indivual models. A list (one formula for each variable).
                    imp_var=FALSE,imp_suffix="imp", 
                    verbose = FALSE){
}

So all in all, the real pain is to pack everything in a sequential approach when variables are of different scale.

GregorDeCillia commented 1 year ago

Just a technical note. Depricating the low-level functions (irmi, rangerImpute and regressionImpute) is not really necessary. We could just use the new high-level vimpute() in the docs as the "recommended way" instead. The advantage here is that the low-level functions still have their own man-pages that can go into details about how the specific algorithm works and documenting parameters that are only applicable for that specific method (using ... to pass them down from vimpute()).