Open pseudotensor opened 2 years ago
Thanks for this write-up and reproducible example! I ran this code tonight and can confirm I got the same results you reported.
Experimenting with this, I found that any of the following individual changes result in the predictions being identical:
subsample
"bagging_seed": 708
to params~monotone_constraints
This definitely seems like a bug, but it seems a bit more specific than "passing all 0s for monotone_constraints
results in a different model".
I think it looks like:
Providing all 0s for
monotone_constraints
results in a different model than ifmonotone_constraints
are not provided, if also using bagging and not settingbagging_seed
.
I don't find the same result as you. Adding bagging_seed doesn't help:
import lightgbm as lgb
import pandas as pd
df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)
model_class = lgb.sklearn.LGBMRegressor
params = {
'min_child_samples': 1,
'subsample': 0.7,
'subsample_freq': 1,
'random_state': 1234,
'bagging_seed': 708,
'deterministic': True,
'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
}
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])
params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])
gives
[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405 25.07867232
38.73125164 30.1278061 23.99720783 19.31545673 23.53189337 19.09648957
13.87783463 50.01956777 22.1214959 15.44131316 21.7363739 23.6298187
16.17185313 20.28922842]
[15.08843553 17.4425064 17.2973164 24.66865837 31.6657934 25.0441509
38.66294203 30.1151615 24.02734721 19.33964692 23.52776624 19.10077709
13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
16.16314868 20.26343255]
Also note that In making the MRE I originally had the bagging_seed set to 1236 and determined it didn't matter/effect the outcome, so that is why I removed it. But it is not relevant, since you can keep it if you wish and still the same problem I described happens.
And again, just because bagging causes it doesn't mean it is the only way it can be caused. That is just a result of my MRE reduction to one specific case that happen to show it.
What version of lightgbm
are you on and how did you install it? I tested this on latest master
(https://github.com/microsoft/LightGBM/commit/305369ddfd6b977484a961c44f400126e4f69029)
>>> lgb.__version__
'3.2.1.99'
I can try on master.
But are you aware of a specific fix?
But are you aware of a specific fix?
No, I didn't know this issue existed until this bug report. Just trying to help narrow it down further.
Same result on latest from pypi on different machine after only installing:
virtualenv jon
source jon/bin/activate
pip install numpy pandas sklearn lightgbm
and the same script using bagging_seed set:
import lightgbm as lgb
import pandas as pd
df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)
model_class = lgb.sklearn.LGBMRegressor
params = {
'min_child_samples': 1,
'subsample': 0.7,
'subsample_freq': 1,
'random_state': 1234,
'bagging_seed': 708,
'deterministic': True,
'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
}
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])
params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])
gives:
[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405 25.07867232
38.73125164 30.1278061 23.99720783 19.31545673 23.53189337 19.09648957
13.87783463 50.01956777 22.1214959 15.44131316 21.7363739 23.6298187
16.17185313 20.28922842]
[15.08843553 17.4425064 17.2973164 24.66865837 31.6657934 25.0441509
38.66294203 30.1151615 24.02734721 19.33964692 23.52776624 19.10077709
13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
16.16314868 20.26343255]
version:
>>> lgb.__version__
'3.3.2'
3.3.2 is a special release that doesn't include most of the changes currently on master
. It's just 3.3.1 + one small patch requested by the CRAN maintainers (see discussion in #4923).
To install from latest master
:
git clone --recursive https://github.com/microsoft/LightGBM.git
cd python-package
python setup.py install
Same result:
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000330 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 253, number of used features: 13
[LightGBM] [Info] Start training from score 22.522925
[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405 25.07867232
38.73125164 30.1278061 23.99720783 19.31545673 23.53189337 19.09648957
13.87783463 50.01956777 22.1214959 15.44131316 21.7363739 23.6298187
16.17185313 20.28922842]
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000261 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 253, number of used features: 13
[LightGBM] [Info] Start training from score 22.522925
[15.08843553 17.4425064 17.2973164 24.66865837 31.6657934 25.0441509
38.66294203 30.1151615 24.02734721 19.33964692 23.52776624 19.10077709
13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
16.16314868 20.26343255]
>>> import lightgbm as lgb
>>> print(lgb.__version__)
3.3.2.99
Ah yeah you're right, I just tried again and got the same result you did. I crossed out the suggestion about bagging_seed
in https://github.com/microsoft/LightGBM/issues/4936#issuecomment-1007878334, must have accidentally changed two things at once when I thought I was testing only that.
df.csv
This is MRE, but more complicated examples also do this and lead to arbitrarily different results for predictions.