Open alexkowa opened 1 year ago
Yeah, it would be great to discuss this.
There are some difficulties/complexity when doing sequential imputation with different kinds of variables such as numeric and categorical. Sometimes other methods (or parametrisation) are used depending on if a variable is categorical or numeric (and in irmi
we have also semi-continous and count variables considered). So, like in mice::mice
there must be then default methods for different kind variables in a data set.
The argument method
can then be either a single string, or a vector of strings with for each variable.
When bootstrapping one has to ensure that all categories are actually sampled in a factor variable for certain imputation methods (otherwise: error). A Bayesian Bootstrap as a way out or tricking the factor levels?
I like the idea of a robust bootstrap just because if robust
then one should also care that accidentially not considerable more outliers are sampled than in the original data, but its not straightforward to be used when having a mix of different scaled variables. The idea is to divide the observations into strata depending on their "outlyingness" and sample from each strata independently.
# model based imputation framework function
# generally idea function that incorporates a framework for model based imputation.
# Options should be
# - sequential modelling
# - using PMM or not
# - bootstrap the model error
# - drawing the predicted value from the "posterior" distribution
# - model options: (robust) regression, ranger, XGBoost, some kind of transformer model
# - complex formula (with `formula`) for each variable possible.
vimpute <- function(data,
variable = colnames(data),
sequential = TRUE,
modeluncertainty = "robustBootstrap" (choices: "none", "bootstrap", "robustBootrap-stratified", "robustBootstrap-xyz", "BayesianBootstrap")
# how to best deal with it that each method has own parameters?
imputationuncertainty = "PMM" (choices: "PMM", "midastouch", "normal", "residual", "pmm_k=NULL,# if integer value use kNN on predicted values. Should be either of length one or length of number of variables?
xvar = colnames(data), # delete this?
method = c("lm", "regression","ranger","XGBoost","GPT") # here I would use default methods for each kind of variable - as in mice - that one can override. Supported methods: "lm", "MM", "ranger", "XGBoost", "GPT", "gam", "robGam")
formula = NULL, # possibililty to override the indivual models. A list (one formula for each variable).
imp_var=FALSE,imp_suffix="imp",
verbose = FALSE){
}
So all in all, the real pain is to pack everything in a sequential approach when variables are of different scale.
Just a technical note. Depricating the low-level functions (irmi, rangerImpute and regressionImpute) is not really necessary. We could just use the new high-level vimpute()
in the docs as the "recommended way" instead. The advantage here is that the low-level functions still have their own man-pages that can go into details about how the specific algorithm works and documenting parameters that are only applicable for that specific method (using ...
to pass them down from vimpute()
).
looking at #73 and #74 maybe what we really should do is consolidating it into a new function and deprecating some functions (irmi, rangerImpute and regressionImpute) ? @matthias-da @GregorDeCillia @JohannesGuss