microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.87k stars 505 forks source link

Flaml/LightGBM - Shouldn't I found better/faster or equal results from FLAML than direct LightGBM? #785

Open wil70 opened 1 year ago

wil70 commented 1 year ago

Hello

I have the following training with LightGBM

model2 = lgb.LGBMRegressor(learning_rate=0.09,max_depth=-5,random_state=42, n_estimators=20000) #, use_missing=FALSE)
model2.fit(x_train,y_train,eval_set=[(x_test,y_test),(x_train,y_train)], verbose=20,eval_metric='logloss') #, init_model=filename2_)

When I look at results it reutrn this

print('Training accuracy {:.4f}'.format(model2.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model2.score(x_test,y_test)))

lgb.plot_importance(model2)
lgb.plot_metric(model2)
Training accuracy 1.0000
Testing accuracy 0.7628

I'm trying with FLAML (please keep in mind I'm new with Python and those libraries) I would expect the results to get better quicker?

from flaml import AutoML
automl = AutoML()
automl.fit(X_train=x_train, y_train=y_train, time_budget=60*60*7, estimator_list=['lgbm'], task='regression')#, estimator_list=['lgbm'], task='regression') #'regression classification
[1.76959344 0.11876356 1.6142814  ... 1.79535421 0.55507853 1.04489782]
LGBMRegressor(colsample_bytree=0.7114672034208275,
              learning_rate=0.013851620085123617, max_bin=1023,
              min_child_samples=9, n_estimators=1877, num_leaves=262,
              reg_alpha=0.009443894441159862, reg_lambda=1.8437202926962308,
              verbose=-1)
Training accuracy 0.9081
Testing accuracy 0.0693

TY!

sonichi commented 1 year ago

In your first code snippet, x_test and y_test are leaked to the fit() function. In the second code snippet, they are not. So the comparison is not fair. Try remove the leak?

wil70 commented 1 year ago

oh, I didn't realize this. TY!

LGBMClassifier

model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=-5,random_state=42, n_estimators=1000) #, use_missing=FALSE)
model.fit(x_train,y_train,eval_set=[(x_test,y_test)], verbose=20,eval_metric='logloss') #, init_model=filename_)

print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
## it took 18 minutes
Training accuracy 0.9970
Testing accuracy 0.5203

AutoML

automl = AutoML()
automl.fit(X_train=x_train, y_train=y_train, time_budget=60*7*10, estimator_list=['lgbm'])   ## estimator_list=['lgbm'], task='regression' classification

print(automl.model.estimator)
print('Training accuracy {:.4f}'.format(automl.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(automl.score(x_test,y_test)))
LGBMClassifier(colsample_bytree=0.8171067273589889,
               learning_rate=0.05450135785484785, max_bin=255,
               min_child_samples=2, n_estimators=46, num_leaves=82,
               reg_alpha=0.011572343074847936, reg_lambda=0.1101875342844144,
               verbose=-1)
## it took 80 minutes
Training accuracy 0.6013
Testing accuracy 0.4260
sonichi commented 1 year ago

oh, I didn't realize this. TY!

LGBMClassifier

model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=-5,random_state=42, n_estimators=1000) #, use_missing=FALSE)
model.fit(x_train,y_train,eval_set=[(x_test,y_test)], verbose=20,eval_metric='logloss') #, init_model=filename_)

print('Training accuracy {:.4f}'.format(model.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test,y_test)))
## it took 18 minutes
Training accuracy 0.9970
Testing accuracy 0.5203

AutoML

automl = AutoML()
automl.fit(X_train=x_train, y_train=y_train, time_budget=60*7*10, estimator_list=['lgbm'])   ## estimator_list=['lgbm'], task='regression' classification

print(automl.model.estimator)
print('Training accuracy {:.4f}'.format(automl.score(x_train,y_train)))
print('Testing accuracy {:.4f}'.format(automl.score(x_test,y_test)))
LGBMClassifier(colsample_bytree=0.8171067273589889,
               learning_rate=0.05450135785484785, max_bin=255,
               min_child_samples=2, n_estimators=46, num_leaves=82,
               reg_alpha=0.011572343074847936, reg_lambda=0.1101875342844144,
               verbose=-1)
## it took 80 minutes
Training accuracy 0.6013
Testing accuracy 0.4260

(x_test,y_test) is still given to LGBMClassifier via eval_set.

wil70 commented 1 year ago

@sonichi I also tried this way

model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=-5,random_state=42, n_estimators=1000) 
model.fit(x_train,y_train,verbose=20,eval_metric='logloss') 

but the result is the same as the test set is not use for the training per-se in "fit()" but for metrics? Like the testing error doesn't seems to be feedback to the training algo, only the training error seems to be. The testing is for metrics I'm guessing?

so I basically have the same results:

Training accuracy 0.9970
Testing accuracy 0.5203

The only thing I could think of that could explain why it is so much faster with AutoML and why the result were better, is somehow the random "learning_rate=0.09,max_depth=-5 or the evalmetric=logloss" I picked for the "LGBMClassifier" are a better starting point right away than what AutoML start with (even after 1 day of training)?

sonichi commented 1 year ago

The result is surprising. How large is the dataset? How many iterations are finished for AutoML? Is the task multi-class or binary classification?

wil70 commented 1 year ago

Hi @sonichi, the file is small (7GB) The file labels are mutli-classes (not binary).

I'm started AutoML with 1000 iterations automl.fit(X_train=x_train, y_train=y_train, time_budget=-1, max_iter=1000, metric='log_loss', estimator_list=['lgbm'], task='classification') I will post soon as this is done, let's see

TY

sonichi commented 1 year ago

Hi @sonichi, the file is small (7GB) The file labels are mutli-classes (not binary).

  • LGBMClassifier, 18min, 1000 iterations
  • AutoML, 80 min, not sure how many iterations

I'm started AutoML with 1000 iterations automl.fit(X_train=x_train, y_train=y_train, time_budget=-1, max_iter=1000, metric='log_loss', estimator_list=['lgbm'], task='classification') I will post soon as this is done, let's see

TY

The max_iter in AutoML.fit() means the number of trials, not n_estimators. Training one LGBMClassifier model takes 18min to finish. 1000 iterations could take roughly 18000 mins (not exactly as different configurations take different cost to train).

  1. Since your evaluation metric is accuracy, set metric="accuracy" in AutoML.fit().
  2. Since you already know one good starting point, set starting_points={"lgbm": {"learning_rate": 0.09, "n_estimators": 1000, "num_leaves": 30000}}. Here I'm assuming max_depth=-5 corresponds to num_leaves=30000. Please verify that before you try this way.
  3. Set max_iter between 10-100 for a reasonable turnaround time. Of course, if you can afford longer time you can set it larger.
wil70 commented 1 year ago

Thanks @sonichi

So I put this

automl.fit(X_train=x_train, y_train=y_train, time_budget=-1, max_iter=50, metric="accuracy", 
    estimator_list=['lgbm'], task='classification',
    starting_points={"lgbm": {"learning_rate": 0.09, "n_estimators": 1000, "num_leaves": 30000}}) #, task='classification')#, estimator_list=['lgbm'], task='regression') 

Hopefully I didn't do any mistake, but it keep crashing, here is the output

[flaml.automl: 11-09 16:33:36] {2600} INFO - task = classification
[flaml.automl: 11-09 16:33:36] {2602} INFO - Data split method: stratified
[flaml.automl: 11-09 16:33:36] {2605} INFO - Evaluation method: holdout
[flaml.automl: 11-09 16:33:47] {2727} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 11-09 16:33:47] {2869} INFO - List of ML learners in AutoML Run: ['lgbm']
[flaml.automl: 11-09 16:33:47] {3174} INFO - iteration 0, current learner lgbm
Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

Python ouput log

error 16:35:23.631: Disposing session as kernel process died ExitCode: 3221225477, Reason: c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\site-packages\traitlets\traitlets.py:2412: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
  warn(
c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\site-packages\traitlets\traitlets.py:2366: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '97d0e048-e0f1-41ad-bc2b-b3ae577572b4' instead of 'b"97d0e048-e0f1-41ad-bc2b-b3ae577572b4"'.
  warn(

info 16:35:23.634: Dispose Kernel process 5660.
error 16:35:23.635: Raw kernel process exited code: 3221225477
error 16:35:23.648: Error in waiting for cell to complete [Error: Canceled future for execute_request message before replies were done
    at t.KernelShellFutureHandler.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:32353)
    at c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:51405
    at Map.forEach (<anonymous>)
    at y._clearKernelState (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:51390)
    at y.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:44872)
    at c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:2218254
    at t.swallowExceptions (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:7:130943)
    at p.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:2218232)
    at t.RawSession.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:2223340)
    at process.processTicksAndRejections (node:internal/process/task_queues:96:5)]
warn 16:35:23.649: Cell completed with errors {
  message: 'Canceled future for execute_request message before replies were done'

Thanks!

sonichi commented 1 year ago

Let's try something simple first.

automl.fit(
    X_train=x_train, y_train=y_train, max_iter=1,
    estimator_list=['lgbm'],
    custom_hp={
        "lgbm": {
            "min_child_samples": {"domain": None},
            "log_max_bin": {"domain": None},
            "colsample_bytree": {"domain": None},
            "reg_alpha": {"domain": None},
            "reg_lambda": {"domain": None},
            "num_leaves": {"domain": None},
        }
    },
    starting_points={
        "lgbm": {
            "learning_rate": 0.09,
            "n_estimators": 1000,
        }
    }
)

This should get you the same LGBMClassifier model. If not, please let me know and I can provide further guidance.

sonichi commented 1 year ago

Thanks @sonichi

So I put this

automl.fit(X_train=x_train, y_train=y_train, time_budget=-1, max_iter=50, metric="accuracy", 
    estimator_list=['lgbm'], task='classification',
    starting_points={"lgbm": {"learning_rate": 0.09, "n_estimators": 1000, "num_leaves": 30000}}) #, task='classification')#, estimator_list=['lgbm'], task='regression') 

Hopefully I didn't do any mistake, but it keep crashing, here is the output

[flaml.automl: 11-09 16:33:36] {2600} INFO - task = classification
[flaml.automl: 11-09 16:33:36] {2602} INFO - Data split method: stratified
[flaml.automl: 11-09 16:33:36] {2605} INFO - Evaluation method: holdout
[flaml.automl: 11-09 16:33:47] {2727} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 11-09 16:33:47] {2869} INFO - List of ML learners in AutoML Run: ['lgbm']
[flaml.automl: 11-09 16:33:47] {3174} INFO - iteration 0, current learner lgbm
Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

Python ouput log

error 16:35:23.631: Disposing session as kernel process died ExitCode: 3221225477, Reason: c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\site-packages\traitlets\traitlets.py:2412: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
  warn(
c:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\site-packages\traitlets\traitlets.py:2366: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '97d0e048-e0f1-41ad-bc2b-b3ae577572b4' instead of 'b"97d0e048-e0f1-41ad-bc2b-b3ae577572b4"'.
  warn(

info 16:35:23.634: Dispose Kernel process 5660.
error 16:35:23.635: Raw kernel process exited code: 3221225477
error 16:35:23.648: Error in waiting for cell to complete [Error: Canceled future for execute_request message before replies were done
  at t.KernelShellFutureHandler.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:32353)
  at c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:51405
  at Map.forEach (<anonymous>)
  at y._clearKernelState (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:51390)
  at y.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:44872)
  at c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:2218254
  at t.swallowExceptions (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:7:130943)
  at p.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:2218232)
  at t.RawSession.dispose (c:\Users\Wilhelm\.vscode\extensions\ms-toolsai.jupyter-2022.9.1202862440\out\extension.node.js:2:2223340)
  at process.processTicksAndRejections (node:internal/process/task_queues:96:5)]
warn 16:35:23.649: Cell completed with errors {
  message: 'Canceled future for execute_request message before replies were done'

Thanks!

Sorry, I just realized that "num_leaves" should be set to 31 as that's the default value.

wil70 commented 1 year ago

Thanks @sonichi

Cool, got it working - TY

1) so I use as you advise:

automl = AutoML()
automl.fit(X_train=x_train, y_train=y_train, time_budget=-1, max_iter=13, metric="accuracy", 
    estimator_list=['lgbm'], task='classification',
    starting_points={"lgbm": {"learning_rate": 0.09, "n_estimators": 1000, "num_leaves": 31}}) #, task='classification')#, estimator_list=['lgbm'], task='regression') #'regression classification

The result is (took a bit less than 48h)

Training accuracy 1.0000
Testing accuracy 0.5591

which is an 0.03 extra point (it took 48h) than the direct LightGBM with Testing accuracy of 0.5203 (18 min) https://github.com/microsoft/FLAML/issues/785#issuecomment-1304597243

2) I can't find how to persist the model so I can restart training from this model I'm able to persiste the best config ie. automl.save_best_config(filename_) and it saves {"class": "lgbm", "hyperparameters": {"n_estimators": 14904, "num_leaves": 139, "min_child_samples": 13, "learning_rate": 0.008057765488837305, "log_max_bin": 8, "colsample_bytree": 0.5449277239599435, "reg_alpha": 0.001538995396757784, "reg_lambda": 0.006215147139705481, "FLAML_sample_size": 75835}} but I couldn't find how to persist the model itself

qingyun-wu commented 1 year ago

Dear @wil70, you could use pickle to persist the model. Please find some code example in this notebook (in code block [7]).

Let me know if you need further help. Thank you!