Closed wanglu2014 closed 2 months ago
Hi @wanglu2014 -- I have a few questions:
1) Can you post the shape of features_treated and y_treated? 2) What's inside the function synthetic_data function? 3) How many unique values does y_treated have?
y X1 X2 X3 X4 X5 treatment
0 1 0.374540 0.950714 0.731994 0.598658 0.156019 1 1 0 0.155995 0.058084 0.866176 0.601115 0.708073 0 2 0 0.020584 0.969910 0.832443 0.212339 0.181825 0 3 0 0.183405 0.304242 0.524756 0.431945 0.291229 0 4 1 0.611853 0.139494 0.292145 0.366362 0.456070 1 5 0 0.785176 0.199674 0.514234 0.592415 0.046450 0 6 1 0.607545 0.170524 0.065052 0.948886 0.965632 0 7 1 0.808397 0.304614 0.097672 0.684233 0.440152 0 8 0 0.122038 0.495177 0.034389 0.909320 0.258780 0 9 0 0.662522 0.311711 0.520068 0.546710 0.184854 1 10 1 0.969585 0.775133 0.939499 0.894827 0.597900 1 11 0 0.921874 0.088493 0.195983 0.045227 0.325330 0 12 0 0.388677 0.271349 0.828738 0.356753 0.280935 1 13 1 0.542696 0.140924 0.802197 0.074551 0.986887 1 14 1 0.772245 0.198716 0.005522 0.815461 0.706857 1 15 1 0.729007 0.771270 0.074045 0.358466 0.115869 1 16 1 0.863103 0.623298 0.330898 0.063558 0.310982 1 17 1 0.325183 0.729606 0.637557 0.887213 0.472215 1 18 0 0.119594 0.713245 0.760785 0.561277 0.770967 1 19 0 0.493796 0.522733 0.427541 0.025419 0.107891 1
I think the underlying problem is that there are 20 samples and 20 classes. With this configuration none of the boosted decision trees inside the model will allow any splits because of the min_hessian parameter, which disallows splits where there isn't a single sample from each of the classes on both sides of any potential tree split. Since no feature can have any splits, the model will simply be the intercept. The intercept however will continue to improve as you accumulate more boosting rounds, so it never early stops. It's a bit surprising to me that it doesn't hit the max_rounds of 5000 within a reasonable amount of time, but this seems to be the case.
0,1,2, ... is rowname, y is 0/1
I think the underlying problem is that there are 20 samples and 20 classes. With this configuration none of the boosted decision trees inside the model will allow any splits because of the min_hessian parameter, which disallows splits where there isn't a single sample from each of the classes on both sides of any potential tree split. Since no feature can have any splits, the model will simply be the intercept. The intercept however will continue to improve as you accumulate more boosting rounds, so it never early stops. It's a bit surprising to me that it doesn't hit the max_rounds of 5000 within a reasonable amount of time, but this seems to be the case.
I think the underlying problem is that there are 20 samples and 20 classes. With this configuration none of the boosted decision trees inside the model will allow any splits because of the min_hessian parameter, which disallows splits where there isn't a single sample from each of the classes on both sides of any potential tree split. Since no feature can have any splits, the model will simply be the intercept. The intercept however will continue to improve as you accumulate more boosting rounds, so it never early stops. It's a bit surprising to me that it doesn't hit the max_rounds of 5000 within a reasonable amount of time, but this seems to be the case.
I'm not quite sure what is going on here. I translated this into something that I could run and it took less than 1 minute to fit the EBM. Here's my version of the code. Can you try running mine and see how long it takes on your system:
I ran it on interpret v0.5.1.
from interpret.glassbox import ExplainableBoostingClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
from io import StringIO
data = """1 0.374540 0.950714 0.731994 0.598658 0.156019 1
0 0.155995 0.058084 0.866176 0.601115 0.708073 0
0 0.020584 0.969910 0.832443 0.212339 0.181825 0
0 0.183405 0.304242 0.524756 0.431945 0.291229 0
1 0.611853 0.139494 0.292145 0.366362 0.456070 1
0 0.785176 0.199674 0.514234 0.592415 0.046450 0
1 0.607545 0.170524 0.065052 0.948886 0.965632 0
1 0.808397 0.304614 0.097672 0.684233 0.440152 0
0 0.122038 0.495177 0.034389 0.909320 0.258780 0
0 0.662522 0.311711 0.520068 0.546710 0.184854 1
1 0.969585 0.775133 0.939499 0.894827 0.597900 1
0 0.921874 0.088493 0.195983 0.045227 0.325330 0
0 0.388677 0.271349 0.828738 0.356753 0.280935 1
1 0.542696 0.140924 0.802197 0.074551 0.986887 1
1 0.772245 0.198716 0.005522 0.815461 0.706857 1
1 0.729007 0.771270 0.074045 0.358466 0.115869 1
1 0.863103 0.623298 0.330898 0.063558 0.310982 1
1 0.325183 0.729606 0.637557 0.887213 0.472215 1
0 0.119594 0.713245 0.760785 0.561277 0.770967 1
0 0.493796 0.522733 0.427541 0.025419 0.107891 1"""
data_io = StringIO(data)
column_names = ['y', 'X1', 'X2', 'X3', 'X4', 'X5', 'treatment']
df = pd.read_csv(data_io, sep=' ', header=None, names=column_names)
df_treated = df[df['treatment'] == 1]
df_control = df[df['treatment'] == 0]
features_treated = df_treated.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']]
y_treated = df_treated['y']
features_control = df_control.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']]
y_control = df_control['y']
ebm_treated = ExplainableBoostingClassifier(
learning_rate=0.01,
max_leaves=5,
max_bins=4,
min_samples_leaf=1,
#n_estimators=4,
random_state=42,
n_jobs=60
)
X_train_treated, X_test_treated, y_train_treated, y_test_treated = train_test_split(features_treated, y_treated, test_size=0.2, random_state=42)
ebm_treated.fit(X_train_treated, y_train_treated)
Thank you for your code. I try it and succeed.
Hi @wanglu2014 -- Is this resolved then? My code above was an attempt to duplicate your dataset, but if mine worked and your original one didn't, then I'm really confused. Any idea what the difference between the two could be? They would still be different by minor variations in the floating-point numbers since the text representations of the floats above only have 6 digits.
It is resolved. Thank you for your timely reply. I will cite your paper if mine published~
Create a synthetic dataset
y, X, treatment, , , _ = synthetic_data(mode=1, n=20, p=5, sigma=1.0)
ebm_treated = ExplainableBoostingClassifier( learning_rate=0.01, max_leaves=5, max_bins=4, min_samples_leaf=1,
n_estimators=4,
)
Save the data in a pandas dataframe
df = pd.DataFrame({'y': y, 'X1': X.T[0], 'X2': X.T[1], 'X3': X.T[2], 'X4': X.T[3], 'X5': X.T[4], 'treatment': treatment})
Split data into treated and control groups
df_treated = df[df['treatment'] == 1] df_control = df[df['treatment'] == 0]
Features and target for treated
features_treated = df_treated.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']] y_treated = df_treated['y']
Features and target for control
features_control = df_control.loc[:, ['X1', 'X2', 'X3', 'X4', 'X5']] y_control = df_control['y']
Train Explainable Boosting Classifier for treated
X_train_treated, X_test_treated, y_train_treated, y_test_treated = train_test_split(features_treated, y_treated, test_size=0.2, random_state=42) ebm_treated.fit(X_train_treated, y_train_treated)
It is a simple dataset, however, the EBM model has run for one day. My computer have 256GB RAM, enough for run it.