mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.01k stars 401 forks source link

Fit best model on new data in Optuna mode #429

Open huanvo88 opened 3 years ago

huanvo88 commented 3 years ago

Hello,

Is there a way to fit the best automl model (using best hyperparameters from optuna grid search + ensemble etc.) on new data? To give a concrete example, let's say I run automl with Optuna mode with a custom validation set to obtain the best automl model, and now I would like to fit that automl model on train + validation sets, and look at the results on an independent test set.

pplonski commented 3 years ago

For Optuna mode it should be possible. You need to create AutoML with optuna_init_params argument pointing to the file with Optuna parameters. The example:

  1. Train AutoML on first data:
automl_1 = AutoML(mode='Optuna', results_path='AutoML_1')
automl_1.fit(X, y)

the best parameters from Optuna tuning will be saved in the file AutoML_1/optuna/optuna.json.

  1. Train on the second data but with the parameters from step 1:
automl_2 = AutoML(mode='Optuna', results_path='AutoML_2', optuna_init_params='AutoML_1/optuna/optuna.json')
automl_2.fit(X_new, y_new)

in this step, the models will be trained with parameters from step 1 but with new data. All models from this step will be saved in AutoML_2 path.

Please let me know if it works.

It is good to test it with small data and small optuna_time_budget, for example 60 seconds, the default tuning time for Optuna model is 3600 seconds.

This is only available in the Optuna mode. In other modes you will need to train model from scratch.

huanvo88 commented 3 years ago

Thanks @pplonski, so I don't think I can put the path directly to optuna_init_params, but have to do something like this

optuna_init = json.loads(open('previous_AutoML_training/optuna/optuna.json').read())

Also when I follow your suggestion, then it does another Optuna grid search for automl_2 on new data again, which is not what I was asking. I just want to fit the best automl model found in the previous Optuna grid search (with validation) on the new data.

pplonski commented 3 years ago

Yes, you are right, you need to load params and pass as dict. Sorry for confusion.

I just run simple toy example and it should work as expected:

import numpy as np
import pandas as pd
from supervised import AutoML

# some toy data
X = np.random.rand(100,10)
y = np.random.rand(100)

# first training
# tuning with Optuna + training
automl = AutoML(mode='Optuna', results_path='automl_first_run', optuna_time_budget=1)
automl.fit(X, y)

# load params
import json
my_params = json.load(open('automl_first_run/optuna/optuna.json'))

# second training
# just train with best params from the first run (no tuning)
automl_2 = AutoML(mode='Optuna', results_path='automl_second_run', optuna_init_params=my_params)
automl_2.fit(X, y)

Is your code similar?

huanvo88 commented 3 years ago

Thanks @pplonski I think I figured out the mistake in my code. In the first run I put n_jobs = 40, so it did not run grid search for neural network. However for the second run I did not specify n_jobs, that is why it spins up an Optuna grid search for neural networks. When I specify n_jobs for both runs it is fine :)

pplonski commented 3 years ago

Great that it works. However, Neural Network should be trained with n_jobs specified, but it doesn't use it (sklearn MLP implementation doesn't have n_jobs). Maybe it is a bug, if Optuna disables NN when n_jobs is specified?

huanvo88 commented 3 years ago

Right when train with Optuna, the first line it says is "Neural Network algorithm was disabled because it does not support n_jobs parameters".

Maybe we should fix it to include Neural Network algorithm when n_jobs is specified as well.

On the other hand, I'm not sure how useful MLP is for tabular data, maybe we are just wasting time doing grid search. Also it is faster to train MLP on GPU anyway.

pplonski commented 3 years ago

@huanvo88 thanks for pasting the output. I remember now, that I disabled it on purpose. Let's leave it as it is. I will update the docs with your use case.