Monotonicity of 0's different result than no monotonicity set

microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

https://lightgbm.readthedocs.io/en/latest/

MIT License

16.7k stars 3.83k forks source link

Monotonicity of 0's different result than no monotonicity set #4936

Open pseudotensor opened 2 years ago

pseudotensor commented 2 years ago

df.csv

import lightgbm as lgb
import pandas as pd

df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
          'min_child_samples': 1,
          'subsample': 0.7,
          'subsample_freq': 1,

          'random_state': 1234,
          'deterministic': True,

          'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          }

model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

This is MRE, but more complicated examples also do this and lead to arbitrarily different results for predictions.

[15.14082331 17.42164852 17.27809565 24.66228158 31.7912921  25.07472208
 38.66674383 30.14650357 24.10321141 19.28946953 23.59302071 19.06902673
 13.8462058  50.01765005 22.09789499 15.48053459 21.67382484 23.59418642
 16.14454625 20.21541095]
[15.11855937 17.458745   17.22814776 24.65581646 31.72203189 25.0968243
 38.65681328 30.14518504 24.03192297 19.31654484 23.60164724 19.06489953
 13.85319787 50.03306803 22.07789009 15.45374806 21.75793041 23.61585482
 16.14733807 20.18282579]

jameslamb commented 2 years ago

Thanks for this write-up and reproducible example! I ran this code tonight and can confirm I got the same results you reported.

Experimenting with this, I found that any of the following individual changes result in the predictions being identical:

removing parameter subsample
~adding "bagging_seed": 708 to params~
removing monotone_constraints

test code (click me)

```python import lightgbm as lgb import pandas as pd import numpy as np data_url = "https://github.com/microsoft/LightGBM/files/7831557/df.csv" df = pd.read_csv(data_url) y = df['y'] X = df.drop('y', axis=1) model_class = lgb.sklearn.LGBMRegressor params = { 'min_child_samples': 1, 'subsample_freq': 1, 'random_state': 1234, 'deterministic': True, 'monotone_constraints': [0] * X.shape[1], } model = model_class(**params) model.fit(X, y) preds1 = model.predict(X) print(preds1[0:20]) params.pop('monotone_constraints', None) model = model_class(**params) model.fit(X, y) preds2 = model.predict(X) print(preds2[0:20]) assert np.allclose(preds1, preds2) ```

This definitely seems like a bug, but it seems a bit more specific than "passing all 0s for monotone_constraints results in a different model".

I think it looks like:

Providing all 0s for monotone_constraints results in a different model than if monotone_constraints are not provided, if also using bagging and not setting bagging_seed.

pseudotensor commented 2 years ago

I don't find the same result as you. Adding bagging_seed doesn't help:

import lightgbm as lgb
import pandas as pd

df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
          'min_child_samples': 1,
          'subsample': 0.7,
          'subsample_freq': 1,

          'random_state': 1234,
          'bagging_seed': 708,
          'deterministic': True,

          'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          }

model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

gives

[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405  25.07867232
 38.73125164 30.1278061  23.99720783 19.31545673 23.53189337 19.09648957
 13.87783463 50.01956777 22.1214959  15.44131316 21.7363739  23.6298187
 16.17185313 20.28922842]
[15.08843553 17.4425064  17.2973164  24.66865837 31.6657934  25.0441509
 38.66294203 30.1151615  24.02734721 19.33964692 23.52776624 19.10077709
 13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
 16.16314868 20.26343255]

Also note that In making the MRE I originally had the bagging_seed set to 1236 and determined it didn't matter/effect the outcome, so that is why I removed it. But it is not relevant, since you can keep it if you wish and still the same problem I described happens.

And again, just because bagging causes it doesn't mean it is the only way it can be caused. That is just a result of my MRE reduction to one specific case that happen to show it.

jameslamb commented 2 years ago

What version of lightgbm are you on and how did you install it? I tested this on latest master (https://github.com/microsoft/LightGBM/commit/305369ddfd6b977484a961c44f400126e4f69029)

pseudotensor commented 2 years ago

>>> lgb.__version__
'3.2.1.99'

I can try on master.

But are you aware of a specific fix?

jameslamb commented 2 years ago

But are you aware of a specific fix?

No, I didn't know this issue existed until this bug report. Just trying to help narrow it down further.

pseudotensor commented 2 years ago

Same result on latest from pypi on different machine after only installing:

virtualenv jon
source jon/bin/activate
pip install numpy pandas sklearn lightgbm

and the same script using bagging_seed set:

import lightgbm as lgb                                                                                                                                                                                    
import pandas as pd

df = pd.read_csv("df.csv")
y = df['y']
X = df.drop('y', axis=1)

model_class = lgb.sklearn.LGBMRegressor
params = {
          'min_child_samples': 1,
          'subsample': 0.7,
          'subsample_freq': 1,

          'random_state': 1234,
          'bagging_seed': 708,
          'deterministic': True,

          'monotone_constraints': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          }

model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

params.pop('monotone_constraints', None)
model = model_class(**params)
model.fit(X, y)
preds=model.predict(X)
print(preds[0:20])

gives:

[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405  25.07867232
 38.73125164 30.1278061  23.99720783 19.31545673 23.53189337 19.09648957
 13.87783463 50.01956777 22.1214959  15.44131316 21.7363739  23.6298187
 16.17185313 20.28922842]
[15.08843553 17.4425064  17.2973164  24.66865837 31.6657934  25.0441509
 38.66294203 30.1151615  24.02734721 19.33964692 23.52776624 19.10077709
 13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
 16.16314868 20.26343255]

version:

>>> lgb.__version__
'3.3.2'

jameslamb commented 2 years ago

3.3.2 is a special release that doesn't include most of the changes currently on master. It's just 3.3.1 + one small patch requested by the CRAN maintainers (see discussion in #4923).

To install from latest master:

git clone --recursive https://github.com/microsoft/LightGBM.git
cd python-package
python setup.py install

pseudotensor commented 2 years ago

Same result:

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000330 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 253, number of used features: 13
[LightGBM] [Info] Start training from score 22.522925
[15.09656531 17.43069474 17.27949377 24.68849732 31.7121405  25.07867232
 38.73125164 30.1278061  23.99720783 19.31545673 23.53189337 19.09648957
 13.87783463 50.01956777 22.1214959  15.44131316 21.7363739  23.6298187
 16.17185313 20.28922842]
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000261 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 253, number of used features: 13
[LightGBM] [Info] Start training from score 22.522925
[15.08843553 17.4425064  17.2973164  24.66865837 31.6657934  25.0441509
 38.66294203 30.1151615  24.02734721 19.33964692 23.52776624 19.10077709
 13.87791364 50.02051734 22.13321018 15.45728055 21.69579512 23.65150691
 16.16314868 20.26343255]

>>> import lightgbm as lgb
>>> print(lgb.__version__)
3.3.2.99

jameslamb commented 2 years ago

Ah yeah you're right, I just tried again and got the same result you did. I crossed out the suggestion about bagging_seed in https://github.com/microsoft/LightGBM/issues/4936#issuecomment-1007878334, must have accidentally changed two things at once when I thought I was testing only that.