sylvaticus / BetaML.jl

Beta Machine Learning Toolkit
MIT License
92 stars 14 forks source link

Rename/Alias `GeneralImputer` to `MICE` #59

Open ParadaCarleton opened 11 months ago

ParadaCarleton commented 11 months ago

The algorithm listed as GeneralImputer here is more widely-known as MICE (Multiple imputation by chained equations) in statistics. I'm not sure if the name used here is standard in ML, but the lack of a solid MICE implementation is a common complaint in the Julia statistics ecosystem, so I was very surprised to stumble across this pure-Julia implementation of MICE under a completely different name. Would it make sense to either rename or alias GeneralImputer to make this easier to discover?

sylvaticus commented 11 months ago

Hmmm... I am aware of the MICE package in R, but there the idea is that the nultiple imputations are "chained" along the whole statistical procedure. Also I am not super fan of their usage in ml models in general . The issue is that there is no guarantee on the origin of the differences between the various imputations, there isn't a probabilistic model determining them. Sometimes this even depends on parameters of the imputation algorithm. So the variance between imputations can not be taken as a measure of the quality or trust in the imputation. But for sure I should add MICE in the models docstring...

ParadaCarleton commented 10 months ago

Hmmm... I am aware of the MICE package in R, but there the idea is that the nultiple imputations are "chained" along the whole statistical procedure.

I'm not sure what you mean here; sorry :sweat_smile: Is this different from GeneralImputer? The docstring is a bit vague.

The issue is that there is no guarantee on the origin of the differences between the various imputations, there isn't a probabilistic model determining them. Sometimes this even depends on parameters of the imputation algorithm. So the variance between imputations can not be taken as a measure of the quality or trust in the imputation.

If you're doing cross-validation or some other resampling strategy, shouldn't that give a good estimate of the model-based uncertainty? Although you could try something fancier (like a Bayesian bootstrap or other ensemble model).

sylvaticus commented 10 months ago

You may be interested in this new package: https://github.com/tom-metherell/Mice.jl

Compared to the imputers in BetaML it provides pooling of the analysis you perform using the imputed values, that you don't have here (you just have the multiple imputations in a vector).

Conversely, BetaML supports random forests that in my (limited) experience perform a better job than pmm for real datasets on which I erased (at random) some data and then checked the quality of the imputation.

ParadaCarleton commented 10 months ago

Compared to the imputers in BetaML it provides pooling of the analysis you perform using the imputed values, that you don't have here (you just have the multiple imputations in a vector).

As in, BetaML just performs one imputation per missing data point, by randomly sampling a possible imputed value?

sylvaticus commented 10 months ago

As in, BetaML just performs one imputation per missing data point, by randomly sampling a possible imputed value?

No. Let's consider some tabular data with records as N rows and dimensions as C cols. BetaML, for each imputation, builds C supervised models of c as a function of c-complement cols and then uses these models to predict the missing values. There is no "sampling" of the missing values. Each imputation is an independent set of models and relative predictions, and the output is a vector of the imputed tables. What distinguishes each imputation is the randomness specific to each supervised model. For example, in random forests it is given by the records used to train the individual decision tree and the subset of dimension employed for that tree, for a neural network estimator it would be the initial weights of the deep layers, etc.