Closed Matgrb closed 3 years ago
Doesn't it have the danger that imputation method is chosen based on the performance on the test set ?
Instead we can combine the most frequently used imputation methods for numerical and categorical variables and create a report with the cv scores .
The users can then choose see and choose which method works and use that in the final pipeline. What do you think ?
Indeed if the user uses the test data to select that, it will cause the bias. However, the user can also use a randomly sampled validation data (and not touch the test holdout) in order to perform this experiment. Probably best way would indeed be cross-validation.
When it comes to the report, i think it is very dependent on the dataset you use and the type of model. It might be better to provide a module that will simply run an experiment. But we could indeed use the methods that are found as most efficient in the literature, and present as a similar plot, and the user should be able to choose from them.
1.The cross validated results performance should be used to evaluate the imputation method.
2.Yes,an experiment and a corresponding report should be the outcome. Then the users can decide the method, they want to go for.
We can create a new module probatus.impute
for comparing the performance of various imputation strategies.
The pseudo code can be as follows :
class CompareImputationStrategies:
def __init__(self):
def fit(X,y,clf,strategies=['No Imputation','Simple','KNN'],cv = 5,scoring='roc_auc'):
"""
The fit method mainly in the parameters and checks if the
data is correct etc.
As of now, we will deal with categorical variables missing data
with KNN,Missing value indicator. Later we can use more complex methods
like [MCA](https://napsterinblue.github.io/notes/stats/techniques/mca/)
X : training set
y : target
clf : classifier used to evaulate
strategy : List of strategies to use.
cv : cross validation to use.
scoring : Scoring parameters
"""
def compute():
"""
The major calculation is done in the compute method.
"""
for strategy in strategies :
create a pipeline with strategy and classifier.
evaluate the model performance using sklearn.model_selection.cross_val_score.
store the results .
plot the results
def fit_compute() :
"""
Fit compute method.
"""
fit()
compute()
def plot():
"""
Plot the results of the comparison.
"""
Very similar to sk-learn example
Thanks to this the users can :
Overall looks good!
Some comments:
BaseFitComputePlotClass
to ensure the consistencyThanks for the quick comments.Good point about the sklearn objects.
Hi Anil, couple more points that just popped to my mind:
clf
parameter. This way you can compute the imputation in X only once per imputation method, and then try out multiple models on that datasets, instead of rerunning the whole computation for each model you want to try. You can also try to use the same cross-validation splits. The logic would be:
clf=[XGBClassifier, Pipeline(OneHotEncoding+LogisticRegression)]
. One issue that needs to be solved is how to plot the names of the models correctly. Maybe we can add optional parameter clf_name=None
in the init just for convenience in report and plotting. verbose
for printing warnings and random_state
to ensure reproducibility of the results. Please have a look into other features, how these are used.Good points. With the current implementation we would be able to achieve most of the above points. Plus it is inline with Probatus interfaces.
Point 2 is a good idea, however it will complicate the implementation and may confuse the users. To keep the implementation simple and make it do only one thing, as of now the users can pass a single classifier and multiple strategies to test. Incase a user plans to test many classifier they can run the comparison within a loop. In that case the users can keep track of models and imputation results.
It might complicate the clf
parameter indeed. However, I think if we allow this as an option, next to just passing the model normally it should not be that bad. Maybe we can pass it as a dict, the same way as we pass imputation strategies now.
The main advantage of having it like this, instead of the loop, is that, you only have to apply each imputation strategy once. If you use it in the loop for every model, then e.g. Iterative KNN imputation has to be run several times, and it is a very costly one. What do you think?
We could also have it as a possible future improvement.
One of the often encountered questions in modelling is how to impute the missing values.
An interesting approach to tackle this is the following:
Then probatus would run the metric volatility to compute the mean and std on train and test for these metrics, models and different imputation methods. Thanks to this the user can select which method to use for a given dataset and model, and understand how volatility of the metrics is affected by it.