Imputation method selection using Metric Volatility

Matgrb commented 3 years ago

One of the often encountered questions in modelling is how to impute the missing values.

An interesting approach to tackle this is the following:

User provides the model he would like to use e.g. XGBoost
User provides the data split into train and test
User provides metrics of interest e.g. AUC, Accuracy
User provides methods for imputation to be tested e.g. zero imputation, median, prediction of values, or no imputation

Then probatus would run the metric volatility to compute the mean and std on train and test for these metrics, models and different imputation methods. Thanks to this the user can select which method to use for a given dataset and model, and understand how volatility of the metrics is affected by it.

anilkumarpanda commented 3 years ago

Doesn't it have the danger that imputation method is chosen based on the performance on the test set ?

Instead we can combine the most frequently used imputation methods for numerical and categorical variables and create a report with the cv scores .

The users can then choose see and choose which method works and use that in the final pipeline. What do you think ?

Matgrb commented 3 years ago

Indeed if the user uses the test data to select that, it will cause the bias. However, the user can also use a randomly sampled validation data (and not touch the test holdout) in order to perform this experiment. Probably best way would indeed be cross-validation.
When it comes to the report, i think it is very dependent on the dataset you use and the type of model. It might be better to provide a module that will simply run an experiment. But we could indeed use the methods that are found as most efficient in the literature, and present as a similar plot, and the user should be able to choose from them.

anilkumarpanda commented 3 years ago

1.The cross validated results performance should be used to evaluate the imputation method.

2.Yes,an experiment and a corresponding report should be the outcome. Then the users can decide the method, they want to go for.

anilkumarpanda commented 3 years ago

We can create a new module probatus.impute for comparing the performance of various imputation strategies. The pseudo code can be as follows :


  class CompareImputationStrategies:
     def __init__(self):

    def fit(X,y,clf,strategies=['No Imputation','Simple','KNN'],cv = 5,scoring='roc_auc'):
        """
        The fit method mainly in the parameters and checks if the
        data is correct etc.
        As of now, we will deal with categorical variables missing data
        with KNN,Missing value indicator. Later we can use more complex methods
        like [MCA](https://napsterinblue.github.io/notes/stats/techniques/mca/)
        X : training set
        y : target
        clf : classifier used to evaulate
        strategy : List of strategies to use.
        cv : cross validation to use.
        scoring : Scoring parameters
        """

    def compute():
        """
        The major calculation is done in the compute method.
        """
        for strategy in strategies :
            create a pipeline with strategy and classifier.
            evaluate the model performance using sklearn.model_selection.cross_val_score.
            store the results .

          plot the results

    def fit_compute() :
        """
        Fit compute method.
        """
        fit()
        compute()

    def plot():
        """
        Plot the results of the comparison.
        """

Very similar to sk-learn example

Thanks to this the users can :

Check which imputation strategy works best for their data by calling a few lines of code.
This implementation can then be extended to include more complex imputation strategies.

Matgrb commented 3 years ago

Overall looks good!

Some comments:

for consistency with other modules, let's take clf and strategy as input to init. Please also check other parameters like cv and scoring whether they are typically in init or fit.
We should allow the user to pass sklearn objects doing the imputation as the strategy parameter. This way we will allow for more flexibility, e.g. for simple imputation you can impute with 0, -1, etc. If the user would pass None, we would use no imputation as strategy, or something like that. Possibly we could also have a default set of imputation objects, that the user could use, with strategy = "default".
I would do major computation in fit, and then compute, basically presents the report for the user. The idea is that the user could run fit once and compute multiple times to get the report.
Compute should return the dataframe with the report e.g. val_score, train_score, and rows are names of the methods used.
You can inherit from BaseFitComputePlotClass to ensure the consistency

anilkumarpanda commented 3 years ago

Thanks for the quick comments.Good point about the sklearn objects.

Matgrb commented 3 years ago

Hi Anil, couple more points that just popped to my mind:

good to document well and in tutorials that clf may be sklearn Pipeline, that performs e.g. onehotencoding of categoricals and then applies the model. Two simple use cases if we have a complex dataset is trying out clf=XGboost,. and clf=Pipeline(OneHotEncoding+LogisticRegression)
A good idea would be to allow user to provide a list of clfs in clf parameter. This way you can compute the imputation in X only once per imputation method, and then try out multiple models on that datasets, instead of rerunning the whole computation for each model you want to try. You can also try to use the same cross-validation splits. The logic would be:
1. Use Cross-Validation to compute X_imputed for each method
2. Use Cross-Validation (same splits) to get scores for each clf on each X_imputed. This way the user can provide clf=[XGBClassifier, Pipeline(OneHotEncoding+LogisticRegression)]. One issue that needs to be solved is how to plot the names of the models correctly. Maybe we can add optional parameter clf_name=None in the init just for convenience in report and plotting.
Another thing to consider if the user wants to try imputation_strategy=[None, ...], and multiple clfs, some models will allow for that and some not. We could try to detect that and run it only for the models that do e.g. XGBoost.
I would propose using two extra parameters to be consistent with the other features: verbose for printing warnings and random_state to ensure reproducibility of the results. Please have a look into other features, how these are used.

anilkumarpanda commented 3 years ago

Good points. With the current implementation we would be able to achieve most of the above points. Plus it is inline with Probatus interfaces.

Point 2 is a good idea, however it will complicate the implementation and may confuse the users. To keep the implementation simple and make it do only one thing, as of now the users can pass a single classifier and multiple strategies to test. Incase a user plans to test many classifier they can run the comparison within a loop. In that case the users can keep track of models and imputation results.

Matgrb commented 3 years ago

It might complicate the clf parameter indeed. However, I think if we allow this as an option, next to just passing the model normally it should not be that bad. Maybe we can pass it as a dict, the same way as we pass imputation strategies now.

The main advantage of having it like this, instead of the loop, is that, you only have to apply each imputation strategy once. If you use it in the loop for every model, then e.g. Iterative KNN imputation has to be run several times, and it is a very costly one. What do you think?

We could also have it as a possible future improvement.

ing-bank / probatus

Imputation method selection using Metric Volatility #73