mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.03k stars 403 forks source link

Support for r2 metric in Optuna mode #340

Closed Possums closed 3 years ago

Possums commented 3 years ago

Currently, r2 metric evaluation is not supported in the tuner/optuna/tuner.py file.

if eval_metric.name not in ["auc", "logloss", "rmse", "mae", "mape"]: raise AutoMLException(f"Metric {eval_metric.name} is not supported")

When I manually add 'r2' to the list, I encounter the following error.

Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1054, in _fit trained = self.train_model(params) File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 356, in train_model mf.train(results_path, model_subpath) File "/usr/local/lib/python3.8/dist-packages/supervised/model_framework.py", line 185, in train self.learner_params = optuna_tuner.optimize( File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/tuner.py", line 106, in optimize objective = LightgbmObjective( File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/lightgbm.py", line 61, in __init__ self.eval_metric_name = metric_name_mapping[ml_task][self.eval_metric.name] KeyError: 'r2'

Is this a known limitation, and if so, is there a way to work around it?

pplonski commented 3 years ago

Currently only ["auc", "logloss", "rmse", "mae", "mape"] metrics are supported by Optuna mode. The R2 is not supported because there were no native support for R2 metric in Xgboost, CatBoost, and LightGBM.

This can be fixed by adding a custom eval_metric.

@Possums do you have other metrics in mind worth adding?

Possums commented 3 years ago

Thanks for the quick response!

I would say other metrics that I would find useful are spearman correlation and RMSLE for regression, and perhaps AUCPR and F1 for classification.

pplonski commented 3 years ago

@Possums sounds good, let me check the details which metrics can be added soon

pplonski commented 3 years ago

I've added support to r2 metric to Optuna mode. The code is pushed to the dev branch.

To try it please install mljar-supervised from the dev branch:

pip install -U git+https://github.com/mljar/mljar-supervised.git@dev

I will add the rest of the metrics as well (spearman, rmsle, aucpr, f1). For f1 metric there can be two implementations:

Possums commented 3 years ago

Thanks for the quick update! I'm currently training a model and haven't run into any errors so far.

pplonski commented 3 years ago

I've added:

I didn't implement RMSLE - need to update target preprocessing for this to assure positive values only. I will add a ticket for this metric (https://github.com/mljar/mljar-supervised/issues/346).

All changes are in the dev branch. To install package directly from the dev branch:

pip install -U git+https://github.com/mljar/mljar-supervised.git@dev
Possums commented 3 years ago

My model just finished training but encountered the following errors.

## Error for 2_Optuna_Xgboost

list index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1056, in _fit
    params["final_loss"] = self._models[-1].get_final_loss()
IndexError: list index out of range

Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

## Error for 3_Optuna_CatBoost

list index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1056, in _fit
    params["final_loss"] = self._models[-1].get_final_loss()
IndexError: list index out of range

Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

## Error for 4_Optuna_RandomForest

list index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1056, in _fit
    params["final_loss"] = self._models[-1].get_final_loss()
IndexError: list index out of range

Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

## Error for 5_Optuna_ExtraTrees

list index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1056, in _fit
    params["final_loss"] = self._models[-1].get_final_loss()
IndexError: list index out of range

Please set a GitHub issue with above error message at: https://github.com/mljar/mljar-supervised/issues/new

Thank you for your help!

pplonski commented 3 years ago

Can you reproduce the error? Can you post some code examples to reproduce the error?

The output looks like there is not model trained at all.

Possums commented 3 years ago
import pandas as pd
from supervised.automl import AutoML

automl = AutoML(mode='Optuna',
                optuna_time_budget=259200,
                ml_task='regression',
                eval_metric='r2')

X_train = pd.read_csv('data.csv')
y_train = X_train.pop('target_column')

automl.fit(X_train, y_train)

Here's the code I'm currently using. A model was definitely trained, as it was running at max CPU load for the entire 3-day period. Below is the final output during training.

[I 2021-03-23 16:36:45,359] Trial 2994 finished with value: 0.0051917264869625335 and parameters: {'learning_rate': 0.025, 'num_le[0/1592]
62, 'lambda_l1': 6.989708473974178, 'lambda_l2': 0.05916462162625953, 'feature_fraction': 0.8764481602033438, 'bagging_fraction': 0.957027
3673237645, 'bagging_freq': 3, 'min_data_in_leaf': 52, 'cat_l2': 66.81630949750212, 'cat_smooth': 57.28175791910296}. Best is trial 1889 w
ith value: 0.00591867047625072.
1_Optuna_LightGBM not trained. Stop training after the first fold. Time needed to train on the first fold 153.0 seconds. The time estimate
 for training on all folds is larger than total_time_limit.
There was an error during 2_Optuna_Xgboost training.                                                                                      
Please check AutoML_1/errors.md for details.                         
There was an error during 3_Optuna_CatBoost training.                                                                                     
Please check AutoML_1/errors.md for details.                         
There was an error during 4_Optuna_RandomForest training.                                                                                 
Please check AutoML_1/errors.md for details.                         
There was an error during 5_Optuna_ExtraTrees training.                                                                                   
Please check AutoML_1/errors.md for details.                         
Skip golden_features because no parameters were generated.                                                                                
Skip insert_random_feature because no parameters were generated.                                                                          
Skip features_selection because no parameters were generated.                                                                             
Skip boost_on_errors because no parameters were generated.                                                                                
2021-03-23 16:39:26,539 supervised.exceptions ERROR No models produced.                                                                   
Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new.                                  
Traceback (most recent call last):                                   
  File "<stdin>", line 1, in <module>                                
  File "<string>", line 12, in <module>                              
  File "/usr/local/lib/python3.8/dist-packages/supervised/automl.py", line 323, in fit                                                    
    return self._fit(X, y, sample_weight)                            
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1091, in _fit                                             
    raise e                       
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1010, in _fit                                             
    raise AutoMLException(                                           
supervised.exceptions.AutoMLException: No models produced.                                                                                
Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new.
pplonski commented 3 years ago

@Possums I'm afraid that you were running hyperparameters optimization with Optuna for 3 days. After 3 days you have only tuned hyperparameters and no model was trained. The good news is that you should have optuna/optuna.json file with tuned parameters and you will not need to tune them again - the training will be faster.

To reuse optuna/optuna.json parameters you need to pass them in the AutoML:

init_params = json.load(open("dir_with_params/optuna/optuna.json", "r"))

automl = AutoML(
    #
    # the config params ...,
    #
    optuna_init_params=init_params
)

What we can do?

  1. Let's try to debug where there is the problem. Mabe try to run AutoML in the Compete mode with small total_time_limit =120?
  2. Do you run the AutoML with the newest code from dev branch?
  3. Please try to run different combinations of AutoML in Optuna mode but with optuna_time_budget=10, so it will fail quickly. For example, run it with algorithms=["Xgboost"].
Possums commented 3 years ago

Thank you again for the support.

Here is the output of optuna.json

{
    "original_LightGBM": {
        "learning_rate": 0.025,
        "num_leaves": 180,
        "lambda_l1": 3.818029469419764,
        "lambda_l2": 4.779917528983006e-05,
        "feature_fraction": 0.7777459702498414,
        "bagging_fraction": 0.9485177139222203,
        "bagging_freq": 4,
        "min_data_in_leaf": 74,
        "cat_l2": 57.416326938339566,
        "cat_smooth": 41.70937114259941,
        "metric": "custom",
        "custom_eval_metric_name": "r2",
        "num_boost_round": 1000,
        "early_stopping_rounds": 50,
        "cat_feature": [
            1
        ],
        "feature_pre_filter": false,
        "seed": 123
    }
}

Your first suggestion (compete mode, total_time_limit=120) works fine.

* Step not_so_random will try to check up to 54 models
11_LightGBM r2 0.000189 trained in 55.45 seconds
Skip golden_features because no parameters were generated.
Skip insert_random_feature because no parameters were generated.
Skip features_selection because no parameters were generated.
Skip hill_climbing_1 because of the time limit.
Skip hill_climbing_2 because of the time limit.
* Step ensemble will try to check up to 1 model
Ensemble r2 0.000658 trained in 0.04 seconds
AutoML fit time: 131.9 seconds
AutoML best model: 1_DecisionTree

I am running the newest code from the dev branch.

Here's the output for the 3rd suggestion (optuna_time_budget=10, algorithms=['Xgboost'])

There was an error during 1_Optuna_Xgboost training.
Please check AutoML_3/errors.md for details.
Skip golden_features because no parameters were generated.
Skip insert_random_feature because no parameters were generated.
Skip features_selection because no parameters were generated.
Skip boost_on_errors because no parameters were generated.
2021-03-24 14:22:38,485 supervised.exceptions ERROR No models produced. 
Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new.
Traceback (most recent call last):
  File "train.py", line 14, in <module>
    automl.fit(X_train, y_train)
  File "/usr/local/lib/python3.8/dist-packages/supervised/automl.py", line 323, in fit
    return self._fit(X, y, sample_weight)
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1092, in _fit
    raise e
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1011, in _fit
    raise AutoMLException(
supervised.exceptions.AutoMLException: No models produced. 
Please check your data or submit a Github issue at https://github.com/mljar/mljar-supervised/issues/new.

And here's errors.md

name 'xgboost_objective' is not defined                                                            
 Traceback (most recent call last):                                                                 
   File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1055, in _fit      
     trained = self.train_model(params)                                                             
   File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 356, in train_model
     mf.train(results_path, model_subpath)                                                          
   File "/usr/local/lib/python3.8/dist-packages/supervised/model_framework.py", line 188, in train  
     self.learner_params = optuna_tuner.optimize(                                                   
   File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/tuner.py", line 133, in      optimize                                                                                           
     objective = XgboostObjective(                                                                  
   File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/xgboost.py", line 54, in     __init__                                                                                           
     self.objective = xgboost_objective(ml_task, eval_metric.name)                                  
 NameError: name 'xgboost_objective' is not defined

When running in Optuna mode, is total_time_limit also necessary? I'm thinking that this could be the issue, since I didn't set that value. Could the library be spending all the time on parameter config, and not actually training models?

pplonski commented 3 years ago

@Possums you are a genius! of course, the problem is with total_time_limit! Ah, I fixed it today in #347 - the Optuna optimization is not counted as the model training time (fixed today). Please try to run it once again. Please get the newest code from the dev branch.

The example code:

init_params = json.load(open("dir_with_params/optuna/optuna.json", "r"))

automl = AutoML(
    #
    # the config params ...,
    #
    total_time_limit=4*3600,
    optuna_time_budget=1800,
    optuna_init_params=init_params
)

Please run it with small optuna_time_budget first, maybe 60 seconds, and total_time_limit=1800.

Possums commented 3 years ago

You're the best! Really appreciate the constant improvements that you're making to mljar. I will try the latest version with a shorter time limit and report back.

Possums commented 3 years ago
AutoML directory: AutoML_2
The task is regression with evaluation metric r2
AutoML will use algorithms: ['Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost']
AutoML will stack models
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'golden_features', 'insert_random_feature', 'features_selection', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
Skip simple_algorithms because no parameters were generated.
* Step default_algorithms will try to check up to 5 models
1_Optuna_LightGBM not trained. Stop training after the first fold. Time needed to train on the first fold 220.0 seconds. The time estimate for training on all folds is larger than total_time_limit.
Optuna optimizes Xgboost with time budget 60 seconds eval_metric r2 (maximize)
[I 2021-03-24 15:18:40,405] A new study created in memory with name: no-name-17a54d29-78cb-452b-987c-1decaa7a780e
[I 2021-03-24 15:19:37,287] Trial 0 finished with value: 0.003899558467413411 and parameters: {'eta': 0.1, 'max_depth': 5, 'lambda': 0.003971722885615567, 'alpha': 8.70060897787558e-05, 'colsample_bytree': 0.8497510164532243, 'subsample': 0.8459830734829203, 'min_child_weight': 24}. Best is trial 0 with value: 0.003899558467413411.
[I 2021-03-24 15:20:17,429] Trial 1 finished with value: 0.0015173661332023025 and parameters: {'eta': 0.025, 'max_depth': 12, 'lambda': 6.118836254358573e-07, 'alpha': 0.21700333406861005, 'colsample_bytree': 0.41117074881725907, 'subsample': 0.3812964804943624, 'min_child_weight': 93}. Best is trial 0 with value: 0.003899558467413411.
2_Optuna_Xgboost r2 0.003116 trained in 1013.01 seconds
Skip golden_features because no parameters were generated.
Skip insert_random_feature because no parameters were generated.
Skip features_selection because no parameters were generated.
Skip boost_on_errors because no parameters were generated.
* Step ensemble will try to check up to 1 model
Skip stack because no parameters were generated.
Skip ensemble_stacked because no parameters were generated.
AutoML fit time: 1388.33 seconds
AutoML best model: 2_Optuna_Xgboost

Using the total_time_limit, optuna_time_budget, and optuna_init_params generated a working model! Now I'll experiment with a longer time limit, fingers crossed it works.

Possums commented 3 years ago
## Error for 3_Optuna_CatBoost

Bad value for num_feature[non_default_doc_idx=0,feature_idx=0]="COLUMN_ONE": Cannot convert 'b'COLUMN_ONE'' to float
Traceback (most recent call last):
  File "_catboost.pyx", line 1980, in _catboost.get_float_feature
  File "_catboost.pyx", line 1085, in _catboost._FloatOrNan
  File "_catboost.pyx", line 917, in _catboost._FloatOrNanFromString
TypeError: Cannot convert 'b'COLUMN_ONE'' to float

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 1055, in _fit
    trained = self.train_model(params)
  File "/usr/local/lib/python3.8/dist-packages/supervised/base_automl.py", line 356, in train_model
    mf.train(results_path, model_subpath)
  File "/usr/local/lib/python3.8/dist-packages/supervised/model_framework.py", line 188, in train
    self.learner_params = optuna_tuner.optimize(
  File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/tuner.py", line 146, in optimize
    objective = CatBoostObjective(
  File "/usr/local/lib/python3.8/dist-packages/supervised/tuner/optuna/catboost.py", line 42, in __init__
    self.eval_set = Pool(
  File "/usr/local/lib/python3.8/dist-packages/catboost/core.py", line 455, in __init__
    self._init(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
  File "/usr/local/lib/python3.8/dist-packages/catboost/core.py", line 966, in _init
    self._init_pool(data, label, cat_features, text_features, embedding_features, pairs, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count)
  File "_catboost.pyx", line 3550, in _catboost._PoolBase._init_pool
  File "_catboost.pyx", line 3597, in _catboost._PoolBase._init_pool
  File "_catboost.pyx", line 3438, in _catboost._PoolBase._init_features_order_layout_pool
  File "_catboost.pyx", line 2477, in _catboost._set_features_order_data_pd_data_frame
  File "_catboost.pyx", line 2021, in _catboost.create_num_factor_data
  File "_catboost.pyx", line 1982, in _catboost.get_float_feature
_catboost.CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=0]="COLUMN_ONE": Cannot convert 'b'COLUMN_ONE'' to float

Just ran into this issue during the longer model test, any ideas what could be causing it? The other models ran fine with optuna, only catboost seems to be doing this.

pplonski commented 3 years ago

@Possums that might be some problem with the categorical column "COLUMN_ONE" in the data? CatBoost is using categoricals without encoding into numbers. I've checked the CatBoost repo and found something like this: https://github.com/catboost/catboost/issues/934

Could you send me data from this column? Maybe you could replace original values with some anonymized ones? Or try to simulate similar data so I can try to reproduce the problem?

Possums commented 3 years ago

Actually, I just realized didn't anonymize the output correctly. COLUMN_ONE is actually a data point of period_1 in the column time_period, which like you said is a categorical column with strings ranging from period_1 to period_3000. As such, the conversion error appears to be with the specific values in the column rather than the column name.

Example data would be "period_1", "period_2", "period_3". This column isn't actually that important to me so I'll probably end up dropping it, but it would be good to have a way to specify categorial variables.

pplonski commented 3 years ago

@Possums MLJAR AutoML automatically detects categorical features. For other than CatBoost algorithms the conversion of categorical into numbers is applied. In your results_path directory there should be data_info.json file - you can check details about your data there.

I would rather bet that there might be some problem with CatBoost ...

Possums commented 3 years ago

Got it, checked the data_info.json file and indeed the column is marked as categorical.

pplonski commented 3 years ago

You can try to optimize only CatBoost and upgrade CatBoost to the latest version 0.25, MLJAR is using 0.24.4 (I didn't update it yet) - maybe this will help ...

The other option is to create minimum reproducible code and try to catch the bug.

pplonski commented 3 years ago

@Possums I'm closing this issue. Thank you for all help and feedback. If you will have problems please add new issue.