microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.92k stars 510 forks source link

AutoML scikit-learn best estimator #1191

Closed lucazav closed 1 year ago

lucazav commented 1 year ago

I trained a classificator using AutoML. Then I run this code to get the best estimator:

best_estimator = model.best_model_for_estimator(model.best_estimator)

I noticed that this estimator is of flaml.automl.model.LGBMEstimator type. I expected a scikit-learn custom estimator. As I need a scikit-learn estimator as output, I tried this way:

best_estimator.estimator

but I get a NoneType object.

Any hint, please? I'm using FLAML 2.0.0

sonichi commented 1 year ago

Could you try setting model_history=True in AutoML.fit()? Otherwise only the best model of all the trials is kept for space efficiency.

lucazav commented 1 year ago

@sonichi I'll try your hint. Anyway, I'm trying to get the best estimator in a scikit-learn type, so I supposed no history is needed.

sonichi commented 1 year ago

Oh right. Then you shouldn't need that hint. Could you share a minimal code example to reproduce this issue?

lucazav commented 1 year ago

Here a repo:

# %%
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from flaml import AutoML
import pickle
import os

# %%
main_path = r'C:\<your-path>'

# %%

dataset = pd.read_csv(os.path.join(main_path, 'titanic-imputed.csv'))
dataset

# %%
# Let's split the dataframe in a small part to be kept for test purpose and
# a large part for training.
X = dataset.drop('Survived',axis=1)
y = dataset[['Survived']]

# Force the float values of Pclass to integer, as Power BI imports it as an int column
X['Pclass'] = X['Pclass'].astype('int')

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.05)

# %%
# Setup the FLAML AutoML experiment properly
automl = AutoML()

settings = {
    "time_budget": 600,  # total running time in seconds
    "metric": 'roc_auc', # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
    "task": 'classification',  # task type
    "log_file_name": 'titanic.log',  # flaml log file
    "seed": 7654321,    # random seed
}

# Get a Pandas series from the single column y_train datarame,
# as automl.fit requires a series for its y_train parameter
y_train_series = y_train.squeeze()

automl.fit(X_train=X_train, y_train=y_train_series, **settings)

# %%
'''retrieve best config and best learner'''
print('Best ML leaner:', automl.best_estimator)
print('Best AUC on validation data: {0:.4g}'.format(1-automl.best_loss))

# %%
best_estimator = automl.best_model_for_estimator(automl.best_estimator).estimator

type(best_estimator)

You can find the CSV file used as training dataset here:

https://1drv.ms/u/s!AtgrmeCPhKh7lM1zLmPdpUBd4bAHCQ

sonichi commented 1 year ago

I can't download the CSV file but I think I know the issue. Please use automl.model.estimator to get the best model's estimator. The other way you are using requires model_histor=True.

lucazav commented 1 year ago

Thank you @sonichi, automl.model.estimator is what I was looking for. Maybe a clear documentation about all this stuff could be really useful to the user.