interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 715 forks source link

How to speed up EBM model? Unbelievable slow. #515

Closed wanglu2014 closed 2 months ago

wanglu2014 commented 2 months ago

Create a synthetic dataset

y, X, treatment, , , _ = synthetic_data(mode=1, n=20, p=5, sigma=1.0)

ebm_treated = ExplainableBoostingClassifier( learning_rate=0.01, max_leaves=5, max_bins=4, min_samples_leaf=1,

n_estimators=4,

random_state=42,
n_jobs=60

)

Save the data in a pandas dataframe

df = pd.DataFrame({'y': y, 'X1': X.T[0], 'X2': X.T[1], 'X3': X.T[2], 'X4': X.T[3], 'X5': X.T[4], 'treatment': treatment})

Split data into treated and control groups

df_treated = df[df['treatment'] == 1] df_control = df[df['treatment'] == 0]

Features and target for treated

features_treated = df_treated.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']] y_treated = df_treated['y']

Features and target for control

features_control = df_control.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']] y_control = df_control['y']

Train Explainable Boosting Classifier for treated

X_train_treated, X_test_treated, y_train_treated, y_test_treated = train_test_split(features_treated, y_treated, test_size=0.2, random_state=42) ebm_treated.fit(X_train_treated, y_train_treated)

It is a simple dataset, however, the EBM model has run for one day. My computer have 256GB RAM, enough for run it.

paulbkoch commented 2 months ago

Hi @wanglu2014 -- I have a few questions:

1) Can you post the shape of features_treated and y_treated? 2) What's inside the function synthetic_data function? 3) How many unique values does y_treated have?

wanglu2014 commented 2 months ago
y        X1        X2        X3        X4        X5  treatment

0 1 0.374540 0.950714 0.731994 0.598658 0.156019 1 1 0 0.155995 0.058084 0.866176 0.601115 0.708073 0 2 0 0.020584 0.969910 0.832443 0.212339 0.181825 0 3 0 0.183405 0.304242 0.524756 0.431945 0.291229 0 4 1 0.611853 0.139494 0.292145 0.366362 0.456070 1 5 0 0.785176 0.199674 0.514234 0.592415 0.046450 0 6 1 0.607545 0.170524 0.065052 0.948886 0.965632 0 7 1 0.808397 0.304614 0.097672 0.684233 0.440152 0 8 0 0.122038 0.495177 0.034389 0.909320 0.258780 0 9 0 0.662522 0.311711 0.520068 0.546710 0.184854 1 10 1 0.969585 0.775133 0.939499 0.894827 0.597900 1 11 0 0.921874 0.088493 0.195983 0.045227 0.325330 0 12 0 0.388677 0.271349 0.828738 0.356753 0.280935 1 13 1 0.542696 0.140924 0.802197 0.074551 0.986887 1 14 1 0.772245 0.198716 0.005522 0.815461 0.706857 1 15 1 0.729007 0.771270 0.074045 0.358466 0.115869 1 16 1 0.863103 0.623298 0.330898 0.063558 0.310982 1 17 1 0.325183 0.729606 0.637557 0.887213 0.472215 1 18 0 0.119594 0.713245 0.760785 0.561277 0.770967 1 19 0 0.493796 0.522733 0.427541 0.025419 0.107891 1

paulbkoch commented 2 months ago

I think the underlying problem is that there are 20 samples and 20 classes. With this configuration none of the boosted decision trees inside the model will allow any splits because of the min_hessian parameter, which disallows splits where there isn't a single sample from each of the classes on both sides of any potential tree split. Since no feature can have any splits, the model will simply be the intercept. The intercept however will continue to improve as you accumulate more boosting rounds, so it never early stops. It's a bit surprising to me that it doesn't hit the max_rounds of 5000 within a reasonable amount of time, but this seems to be the case.

wanglu2014 commented 2 months ago

0,1,2, ... is rowname, y is 0/1

I think the underlying problem is that there are 20 samples and 20 classes. With this configuration none of the boosted decision trees inside the model will allow any splits because of the min_hessian parameter, which disallows splits where there isn't a single sample from each of the classes on both sides of any potential tree split. Since no feature can have any splits, the model will simply be the intercept. The intercept however will continue to improve as you accumulate more boosting rounds, so it never early stops. It's a bit surprising to me that it doesn't hit the max_rounds of 5000 within a reasonable amount of time, but this seems to be the case.

wanglu2014 commented 2 months ago

I think the underlying problem is that there are 20 samples and 20 classes. With this configuration none of the boosted decision trees inside the model will allow any splits because of the min_hessian parameter, which disallows splits where there isn't a single sample from each of the classes on both sides of any potential tree split. Since no feature can have any splits, the model will simply be the intercept. The intercept however will continue to improve as you accumulate more boosting rounds, so it never early stops. It's a bit surprising to me that it doesn't hit the max_rounds of 5000 within a reasonable amount of time, but this seems to be the case.

图片

paulbkoch commented 2 months ago

I'm not quite sure what is going on here. I translated this into something that I could run and it took less than 1 minute to fit the EBM. Here's my version of the code. Can you try running mine and see how long it takes on your system:

I ran it on interpret v0.5.1.

from interpret.glassbox import ExplainableBoostingClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from io import StringIO
data = """1 0.374540 0.950714 0.731994 0.598658 0.156019 1
0 0.155995 0.058084 0.866176 0.601115 0.708073 0
0 0.020584 0.969910 0.832443 0.212339 0.181825 0
0 0.183405 0.304242 0.524756 0.431945 0.291229 0
1 0.611853 0.139494 0.292145 0.366362 0.456070 1
0 0.785176 0.199674 0.514234 0.592415 0.046450 0
1 0.607545 0.170524 0.065052 0.948886 0.965632 0
1 0.808397 0.304614 0.097672 0.684233 0.440152 0
0 0.122038 0.495177 0.034389 0.909320 0.258780 0
0 0.662522 0.311711 0.520068 0.546710 0.184854 1
1 0.969585 0.775133 0.939499 0.894827 0.597900 1
0 0.921874 0.088493 0.195983 0.045227 0.325330 0
0 0.388677 0.271349 0.828738 0.356753 0.280935 1
1 0.542696 0.140924 0.802197 0.074551 0.986887 1
1 0.772245 0.198716 0.005522 0.815461 0.706857 1
1 0.729007 0.771270 0.074045 0.358466 0.115869 1
1 0.863103 0.623298 0.330898 0.063558 0.310982 1
1 0.325183 0.729606 0.637557 0.887213 0.472215 1
0 0.119594 0.713245 0.760785 0.561277 0.770967 1
0 0.493796 0.522733 0.427541 0.025419 0.107891 1"""
data_io = StringIO(data)
column_names = ['y', 'X1', 'X2', 'X3', 'X4', 'X5', 'treatment']
df = pd.read_csv(data_io, sep=' ', header=None, names=column_names)
df_treated = df[df['treatment'] == 1]
df_control = df[df['treatment'] == 0]
features_treated = df_treated.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']]
y_treated = df_treated['y']
features_control = df_control.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']]
y_control = df_control['y']
ebm_treated = ExplainableBoostingClassifier(
learning_rate=0.01,
max_leaves=5,
max_bins=4,
min_samples_leaf=1,
#n_estimators=4,
random_state=42,
n_jobs=60
)
X_train_treated, X_test_treated, y_train_treated, y_test_treated = train_test_split(features_treated, y_treated, test_size=0.2, random_state=42)
ebm_treated.fit(X_train_treated, y_train_treated)
wanglu2014 commented 2 months ago

Thank you for your code. I try it and succeed.

paulbkoch commented 2 months ago

Hi @wanglu2014 -- Is this resolved then? My code above was an attempt to duplicate your dataset, but if mine worked and your original one didn't, then I'm really confused. Any idea what the difference between the two could be? They would still be different by minor variations in the floating-point numbers since the text representations of the floats above only have 6 digits.

wanglu2014 commented 2 months ago

It is resolved. Thank you for your timely reply. I will cite your paper if mine published~