Open JustinKurland opened 2 years ago
@JustinKurland thank you! Looks like all metrics are available in sklearn:
Should be easy to add them. In the meantime, the custom metric parameter can be used (for details please check the https://github.com/mljar/mljar-supervised/issues/390#issuecomment-830049603)
@pplonski Many thanks this was helpful. I am wondering -- along this same vein -- what, if anything, can be done for imbalanced classification problems. I have seen here (and in the slack channel exchange with Diogo) that there was some mention around the function _check_imbalanced
and later saw you make reference in git to _handle_drastic_imbalance
, however I do not see any reference to this. I am presently dealing with an imbalanced classification problem and would like to leveraged mljar, but see no way to given the problem set. At present results as you might expect are less than optimal b/c I am uncertain as to how to address the imbalance in mljar:
Typically using xgboost and Optuna I would leverage:
scale_pos_weight
in the event that only AUC/class was of importance or max_delta_step
(if this was solely xgboost) if predicting the correct probabilities matter, and then focus on subsample
, colsample_bytree
, max_depth
, min_child_weight
, eta
and gamma
hyperparameters.
Any suggestions/guidance on how I might take advantage of mljar for my use case would be greatly appreciated. Thanks again for this great work @pplonski !!
Right now, you can try to manually under/over-sample your dataset and then use the MLJAR AutoML. There is a nice package https://github.com/scikit-learn-contrib/imbalanced-learn
Maybe you can even do this in some simple loop, where you apply transformation and then run AutoML. If you found a nice way of doing this, it will be very helpful for others if you will provide some code samples (in separate issue with a proper title so they can be found). Maybe they should be added to the docs?
On the other hand, will be nice to have the methods for handling imbalanced problems already in the AutoML.
@pplonski absolutely. In thinking about this, when initiating AutoML as follows:
automl = AutoML(mode="Optuna", eval_metric='accuracy', optuna_time_budget=60*10)
it would be really beneficial to have an additional param, e.g., imbalanced=True
, where the default could be False, and current MLJAR AutoML functionality would remain the same, but in the event there is imbalance it could be initiated simply by ...
automl = AutoML(mode='Optuna', eval_metric='accuracy', **imbalanced=True**, optuna_time_budget=60*10)
such that when this param is True the hyperparameter tuning, specifically for the Optuna
mode, is modified and hyperparameters for xgboost, lightgbm, catboost, etc. that are indeed necessary for such imbalanced problems are addressed with greater focus. I may be able to help with this as I noted I am presently working on this problem and have been begun to create functions for Optuna for this in xgboost, but this would need to be extended obviously to all learners so will take some time.
I actually believe that such a change would give MLJAR a big advantage over other current AutoML packages as I have not found either h2o
or autogluon
to be particularly strong in this area.
such that when this param is True the hyperparameter tuning, specifically for the Optuna mode, is modified and hyperparameters for xgboost, lightgbm, catboost, etc. that are indeed necessary for such imbalanced problems are addressed with greater focus.
Maybe you can prepare the list of which parameters should be tuned for imbalanced problems? What ranges should be applied for parameters if imbalanced=True
. With such a list, the code changes to add imbalanced=True
should be easy. For the first version, it will be nice to cover xgboost, lightgbm, and catboost.
Maybe you can prepare the list of which parameters should be tuned for imbalanced problems? What ranges should be applied for parameters if
imbalanced=True
. With such a list, the code changes to addimbalanced=True
should be easy. For the first version, it will be nice to cover xgboost, lightgbm, and catboost.
I will be doing some work on lightgbm
and catboost
in the next several weeks, but for xgboost
via Optuna
:
subsample
: Lower ratios help avoid overfitting, e.g., trial.suggest_float("subsample", 0.5, 1.0)
colsample_bytree
: Lower ratios help avoid overfittingmax_depth
: Lower values help avoid overfitting, e.g., trial.suggest_int("max_depth", 1, 10, step=1)
min_child_weight
: Higher values help avoid overfittingeta
: Lower values help avoid overfitting, e.g., trial.suggest_float("eta", 1e-8, .2, log=True)
gamma
: Higher values help avoid overfittingObviously each of these comes with tradeoffs, but they are the hyperparameters for xgboost that need attention for imbalanced classification problems. I will provide a more complete list and ranges for this parameter. 😄
ling with an imbalanced classification problem and would like to leveraged mljar, but see no way to given the problem set. At present results as you might ex
Did you succeed ?
There are several evaluation metrics that would be particularly beneficial for (binary) imbalanced classification problems and would be greatly appreciated additions. In terms of prioritizing implementation (and likely ease of implementation I will rank-order these):