Additional Evaluation Metric / Imbalanced Problems

mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

https://mljar.com

MIT License

3k stars 401 forks source link

Additional Evaluation Metric / Imbalanced Problems #492

Open JustinKurland opened 2 years ago

JustinKurland commented 2 years ago

There are several evaluation metrics that would be particularly beneficial for (binary) imbalanced classification problems and would be greatly appreciated additions. In terms of prioritizing implementation (and likely ease of implementation I will rank-order these):

AUCPR - helpful in the event that class labels are needed and the positive class is of greater importance.
F2 Score - helpful in the event that class labels are needed but false negatives are more costly.
F0.5 Score - helpful in the event class labels are needed but false positives are more costly.

pplonski commented 2 years ago

@JustinKurland thank you! Looks like all metrics are available in sklearn:

Should be easy to add them. In the meantime, the custom metric parameter can be used (for details please check the https://github.com/mljar/mljar-supervised/issues/390#issuecomment-830049603)

JustinKurland commented 2 years ago

@pplonski Many thanks this was helpful. I am wondering -- along this same vein -- what, if anything, can be done for imbalanced classification problems. I have seen here (and in the slack channel exchange with Diogo) that there was some mention around the function _check_imbalanced and later saw you make reference in git to _handle_drastic_imbalance, however I do not see any reference to this. I am presently dealing with an imbalanced classification problem and would like to leveraged mljar, but see no way to given the problem set. At present results as you might expect are less than optimal b/c I am uncertain as to how to address the imbalance in mljar:

Typically using xgboost and Optuna I would leverage: scale_pos_weight in the event that only AUC/class was of importance or max_delta_step (if this was solely xgboost) if predicting the correct probabilities matter, and then focus on subsample, colsample_bytree, max_depth, min_child_weight, eta and gamma hyperparameters.

Any suggestions/guidance on how I might take advantage of mljar for my use case would be greatly appreciated. Thanks again for this great work @pplonski !!

pplonski commented 2 years ago

Right now, you can try to manually under/over-sample your dataset and then use the MLJAR AutoML. There is a nice package https://github.com/scikit-learn-contrib/imbalanced-learn

Maybe you can even do this in some simple loop, where you apply transformation and then run AutoML. If you found a nice way of doing this, it will be very helpful for others if you will provide some code samples (in separate issue with a proper title so they can be found). Maybe they should be added to the docs?

On the other hand, will be nice to have the methods for handling imbalanced problems already in the AutoML.

JustinKurland commented 2 years ago

@pplonski absolutely. In thinking about this, when initiating AutoML as follows:

automl = AutoML(mode="Optuna", eval_metric='accuracy', optuna_time_budget=60*10)

it would be really beneficial to have an additional param, e.g., imbalanced=True , where the default could be False, and current MLJAR AutoML functionality would remain the same, but in the event there is imbalance it could be initiated simply by ...

automl = AutoML(mode='Optuna', eval_metric='accuracy', **imbalanced=True**, optuna_time_budget=60*10)

such that when this param is True the hyperparameter tuning, specifically for the Optuna mode, is modified and hyperparameters for xgboost, lightgbm, catboost, etc. that are indeed necessary for such imbalanced problems are addressed with greater focus. I may be able to help with this as I noted I am presently working on this problem and have been begun to create functions for Optuna for this in xgboost, but this would need to be extended obviously to all learners so will take some time.

I actually believe that such a change would give MLJAR a big advantage over other current AutoML packages as I have not found either h2o or autogluon to be particularly strong in this area.

pplonski commented 2 years ago

such that when this param is True the hyperparameter tuning, specifically for the Optuna mode, is modified and hyperparameters for xgboost, lightgbm, catboost, etc. that are indeed necessary for such imbalanced problems are addressed with greater focus.

Maybe you can prepare the list of which parameters should be tuned for imbalanced problems? What ranges should be applied for parameters if imbalanced=True. With such a list, the code changes to add imbalanced=True should be easy. For the first version, it will be nice to cover xgboost, lightgbm, and catboost.

JustinKurland commented 2 years ago

Maybe you can prepare the list of which parameters should be tuned for imbalanced problems? What ranges should be applied for parameters if imbalanced=True. With such a list, the code changes to add imbalanced=True should be easy. For the first version, it will be nice to cover xgboost, lightgbm, and catboost.

I will be doing some work on lightgbm and catboost in the next several weeks, but for xgboost via Optuna:

subsample : Lower ratios help avoid overfitting, e.g., trial.suggest_float("subsample", 0.5, 1.0)
colsample_bytree: Lower ratios help avoid overfitting
max_depth: Lower values help avoid overfitting, e.g., trial.suggest_int("max_depth", 1, 10, step=1)
min_child_weight: Higher values help avoid overfitting
eta: Lower values help avoid overfitting, e.g., trial.suggest_float("eta", 1e-8, .2, log=True)
gamma: Higher values help avoid overfitting

Obviously each of these comes with tradeoffs, but they are the hyperparameters for xgboost that need attention for imbalanced classification problems. I will provide a more complete list and ranges for this parameter. 😄

Karlheinzniebuhr commented 2 years ago

ling with an imbalanced classification problem and would like to leveraged mljar, but see no way to given the problem set. At present results as you might ex

Did you succeed ?