Overfitting when using AutoML

microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

https://microsoft.github.io/FLAML/

MIT License

3.87k stars 504 forks source link

Overfitting when using AutoML #1140

Open leelew opened 1 year ago

leelew commented 1 year ago

Hi,

We used FLAML to perform regression task, and found AutoML model was easy to be overfitted. However, in the same task, other ML models e.g., LightGBM, RF, could avoid overfitting by grid search best parameters. We tried add 'cv=5' into the AutoML model, but it did not work on our case.

So could you give me some suggestions on how to avoid overfitting when using FLAML AutoML models?

BTW: We also used flame.default.LGBMRegressor() to perform auto-search hyper-parameters of LightGBM model, but this model is still overfitting. But LightGBM model could be avoid overfitting by grid search methods. So I think maybe I misuse FLAML.

Lu Li

The code of FLAML AutoML models: from flaml import AutoML am = AutoML() am.fit(x_train, y_train, task="regression")

The performance on training data: 14031690032103_ pic

The performance on test data: 14041690032104_ pic

sonichi commented 1 year ago

By default, "r2" is used as the optimization metric for regression tasks. Looking at your plots, the model doesn't overfit the r2 or KGE metric. The model overfits RMSE. If you'd like to use RMSE as the optimize metric, please set metric="rmse".

leelew commented 1 year ago

Hi Chi,

Thanks for your reply.

I think our model not only overfit RMSE, but also R2 and KGE (i.e., the performance on training data is much better than on test data). We will try to set metric=rmse, and set split_ratio=0.2.

The code is shown as: automl.fit(x_train, y_train, task = 'regression', metric = 'rmse', split_ratio=0.2, ensemble={ 'final_estimator': MLPRegressor(), 'passthrough': True }, time_budget=3600)

We will further contact you if this did not work. Thanks again for your help!

Best, Lu Li

leelew commented 1 year ago

Hi Chi,

We set metric=rmse and used holdout strategy (split_ratio=0.2). However, we also found overfitting problem. Although we found AutoML could perform better than other ML models on test data, but the train performance of AutoML is much better than test performance.

Is there any further suggestion to avoid overfitting when using AutoML?

Best, Lu

The code is: automl.fit(x_train, y_train, task = 'regression', metric = 'rmse', split_ratio=0.2, ensemble={ 'final_estimator': LGBMRegressor(), 'passthrough': True }, time_budget=3600)

The train performance is: 8961690159313_ pic

The test performance is: 8971690159313_ pic