Question: predicted value will change a lot if I add constant or multiply constant to the features

aBeginner666 commented 3 years ago

Hi there,

I'm working on a regression problem with lightGBM 2.2.3, and I encountered a weird issue:

If I add a constant or multiply a positive constant to all features, the predicted value would also change a lot. This doesn't make sense to me given that objective = regression; boosting_type = gbdt; no lambda_l1 or lambda_l2。

In my opinion, the above operation shouldn't impact the histogram, the split point; also if we plug in the sum_gradient / (sum_hessian + lambda) formula, the leaf value shouldn't change either.

Any thoughts would be greatly appreciated!

jameslamb commented 3 years ago

Thanks very much for using LightGBM!

Can you please provide a reproducible example, some self-contained code that maintainers could run to understand exactly what you are doing? Without that, you are asking us to guess what your code is doing (for example, you are asking us to guess what "a lot" means), which will delay the time it takes to resolve this issue and pull maintainers' limited attention away from other parts of the project.

I think your expectation is correct, that scaling your features should generally not be required, but there are many caveats to that and I can't give a more specific answer without seeing your specific code.

shiyu1994 commented 3 years ago

Hi @aBeginner666, thanks for using LightGBM. LightGBM will discretize the features before training. With feature values being added to or multiplied by a constant, the discretization process can be influenced, but a big change in the prediction values is not expected. Could you please provide an example to demonstrate your case? Thanks.

aBeginner666 commented 3 years ago

Hi James/Shiyu,

Thanks a lot for providing info. Please see the below toy code, the real application has more weird distributions of features.

It seems that if I get rid of the rounding, the issue would disappear; and if I use rounding, and scale the feature down a bit, though the MSE didn't change much, the Y_pred would change considerably in terms of percentage. I think roudning would cause some slight change, e.g., 1% or 5%, but it doesn't make sense to me if Y_pred could change like 50%. Thanks!

  np.random.seed(1234)
  Y = (np.random.rand(1000000) - 0.5) / 1000.
  X = np.random.rand(1000000, 100)
  X_train = X[:800000]
  Y_train = Y[:800000]
  X_test = X[800000:]
  Y_test = Y[800000:]

  params = {
          'task': 'train',
          'boosting_type': 'gbdt',
          'objective': 'regression',
          'num_threads': 10,
          'verbose': -1
  }

  lgb_train = lgb.Dataset(np.round(X_train, 7), Y_train)
  gbm = lgb.train(params, lgb_train)
  Y_pred = gbm.predict(np.round(X_test, 7))
  print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2)))

  lgb_train2 = lgb.Dataset(np.round(X_train / 10., 7), Y_train)
  gbm2 = lgb.train(params, lgb_train2)
  Y_pred2 = gbm2.predict(np.round(X_test  / 10., 7))
  print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2)))

  print("Diff pct: ", np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2))

jameslamb commented 3 years ago

Thanks very much, taking a look at this right now.

By the way, I've edited your comment to use code formatting from GitHub, so it is easier to read. In case you are not familiar with how to do this on GitHub, please see https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

jameslamb commented 3 years ago

Ok @aBeginner666 , I believe I know what's happening here!

Running a modified version of the code you provided, with the latest version of lightgbm installed from master (as of the most recent commit, https://github.com/microsoft/LightGBM/commit/ec1debcee84e6b02f749282d9a501021370e74a9), I was able to reproduce the behavior you observed, where the predictions differ by a larger amount in percentage terms.

code (click me)

```python import numpy as np import lightgbm as lgb random_seed = 1234 total_rows = 10_000 train_rows = 8_000 np.random.seed(random_seed) Y = (np.random.rand(total_rows) - 0.5) / 1000. X = np.random.rand(total_rows, 100) X_train = X[:train_rows] Y_train = Y[:train_rows] X_test = X[train_rows:] Y_test = Y[train_rows:] params = { 'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression', 'num_threads': 10, 'verbose': 1, 'random_state': random_seed, 'deterministic': True } # no scaling lgb_train = lgb.Dataset(np.round(X_train, 7), Y_train) gbm = lgb.train(params, lgb_train) Y_pred = gbm.predict(np.round(X_test, 7)) print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2))) # divide all features by 10 lgb_train2 = lgb.Dataset(np.round(X_train / 10., 7), Y_train) gbm2 = lgb.train(params, lgb_train2) Y_pred2 = gbm2.predict(np.round(X_test / 10., 7)) print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2))) # multiply all features by 10 lgb_train3 = lgb.Dataset(np.round(X_train * 10., 7), Y_train) gbm3 = lgb.train(params, lgb_train3) Y_pred3 = gbm3.predict(np.round(X_test * 10., 7)) print("MSE: ", np.mean(np.power(Y_pred3 - Y_test, 2))) percent_differences = np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2) print("Percentage differences (X divided by 10)") print(" * median: ", np.median(percent_differences)) print(" * max: ", np.max(percent_differences)) percent_differences = np.round((Y_pred - Y_pred3) / Y_pred3 * 100, 2) print("Percentage differences (X multiplied by 10)") print(" * median: ", np.median(percent_differences)) print(" * max: ", np.max(percent_differences)) print("preds") print(Y_pred) print(Y_pred2) print(Y_pred3) ```

logs (click me)

> [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007233 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 25500 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100 [LightGBM] [Info] Start training from score 0.000000 MSE: 8.705721326032536e-08 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005217 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 25500 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100 [LightGBM] [Info] Start training from score 0.000000 MSE: 8.709472033060872e-08 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005125 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 25500 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100 [LightGBM] [Info] Start training from score 0.000000 MSE: 8.790623469039183e-08 Percentage differences (X divided by 10) > * median: -49.239999999999995 > * max: 682289.33 > Percentage differences (X multiplied by 10) > * median: -46.1 > * max: 22608.09 > preds > [-9.31665080e-05 6.60082757e-05 -7.63524038e-05 ... -2.54900077e-05 > 7.03237727e-05 3.45117887e-05] > [-1.00656321e-04 2.93146381e-05 -4.54363906e-05 ... -2.08277483e-05 > 1.51643736e-05 3.52420326e-05] > [-6.57935839e-05 1.56461275e-05 1.14967903e-05 ... -3.86299056e-05 > -7.78827875e-06 3.34326779e-05]

I think that I see the issue. Your use of rounding in the example code is what is causing this difference. When I remove rounding, the predictions are identical regardless of multiplying or dividing features by a constant.

code (click me)

```python import numpy as np import lightgbm as lgb random_seed = 1234 total_rows = 10_000 train_rows = 8_000 np.random.seed(random_seed) Y = (np.random.rand(total_rows) - 0.5) / 1000. X = np.random.rand(total_rows, 100) X_train = X[:train_rows] Y_train = Y[:train_rows] X_test = X[train_rows:] Y_test = Y[train_rows:] params = { 'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression', 'num_threads': 10, 'verbose': 1, 'random_state': random_seed, 'deterministic': True } # no scaling lgb_train = lgb.Dataset(X_train, Y_train) gbm = lgb.train(params, lgb_train) Y_pred = gbm.predict(X_test) print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2))) # divide all features by 10 lgb_train2 = lgb.Dataset(X_train / 10., Y_train) gbm2 = lgb.train(params, lgb_train2) Y_pred2 = gbm2.predict(X_test / 10.) print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2))) # multiply all features by 10 lgb_train3 = lgb.Dataset(X_train * 10., Y_train) gbm3 = lgb.train(params, lgb_train3) Y_pred3 = gbm3.predict(X_test * 10.) print("MSE: ", np.mean(np.power(Y_pred3 - Y_test, 2))) percent_differences = np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2) print("Percentage differences (X divided by 10)") print(" * median: ", np.median(percent_differences)) print(" * max: ", np.max(percent_differences)) percent_differences = np.round((Y_pred - Y_pred3) / Y_pred3 * 100, 2) print("Percentage differences (X multiplied by 10)") print(" * median: ", np.median(percent_differences)) print(" * max: ", np.max(percent_differences)) print("preds") print(Y_pred) print(Y_pred2) print(Y_pred3) ```

logs (click me)

> [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007441 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 25500 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100 [LightGBM] [Info] Start training from score 0.000000 MSE: 8.790842276184981e-08 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005851 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 25500 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100 [LightGBM] [Info] Start training from score 0.000000 MSE: 8.790842276184981e-08 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.015088 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 25500 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 100 [LightGBM] [Info] Start training from score 0.000000 MSE: 8.790842276184981e-08 Percentage differences (X divided by 10) > * median: 0.0 > * max: -0.0 > Percentage differences (X multiplied by 10) > * median: 0.0 > * max: -0.0 > preds > [-6.57937326e-05 1.56459788e-05 1.14966416e-05 ... -3.86300542e-05 > -7.78842743e-06 3.34325292e-05] > [-6.57937326e-05 1.56459788e-05 1.14966416e-05 ... -3.86300542e-05 > -7.78842743e-06 3.34325292e-05] > [-6.57937326e-05 1.56459788e-05 1.14966416e-05 ... -3.86300542e-05 > -7.78842743e-06 3.34325292e-05]

So said another way, you are losing some information by rounding, and that makes the results not quite comparable to each other because the training data in the different cases are genuinely different, not just scaled versions of the same data.

round(x / 10, 7) is not the same as round(x, 7) / 10.

x = 5.123456789

round(x / 10, 7)
# 0.5123457

round(x, 7) / 10
# 0.5123456799999999

These differences matter for totally-random data.

To confirm that multiplying all features by a constant does not change the predictions, you could also try the sample code below using scikit-learn's make_regression(), which generates features that are correlated with the target.

code (click me)

```python import numpy as np import lightgbm as lgb from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split random_seed = 1234 total_rows = 10_000 train_rows = 8_000 X, Y = make_regression(n_samples=total_rows, n_features=10, n_informative=10, random_state=random_seed) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=random_seed) params = { 'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression', 'num_threads': 10, 'verbose': 1, 'random_state': random_seed, 'deterministic': True } # no scaling lgb_train = lgb.Dataset(X_train, Y_train) gbm = lgb.train(params, lgb_train) Y_pred = gbm.predict(X_test) print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2))) # divide all features by 10 lgb_train2 = lgb.Dataset(X_train / 10., Y_train) gbm2 = lgb.train(params, lgb_train2) Y_pred2 = gbm2.predict(X_test / 10.) print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2))) # multiply all features by 10 lgb_train3 = lgb.Dataset(X_train * 10., Y_train) gbm3 = lgb.train(params, lgb_train3) Y_pred3 = gbm3.predict(X_test * 10.) print("MSE: ", np.mean(np.power(Y_pred3 - Y_test, 2))) percent_differences = np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2) print("Percentage differences (X divided by 10)") print(" * median: ", np.median(percent_differences)) print(" * max: ", np.max(percent_differences)) percent_differences = np.round((Y_pred - Y_pred3) / Y_pred3 * 100, 2) print("Percentage differences (X multiplied by 10)") print(" * median: ", np.median(percent_differences)) print(" * max: ", np.max(percent_differences)) print("preds") print(Y_pred) print(Y_pred2) print(Y_pred3) ```

logs (click me)

> [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001190 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 2550 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 10 [LightGBM] [Info] Start training from score -1.374656 MSE: 490.2247448790313 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000671 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 2550 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 10 [LightGBM] [Info] Start training from score -1.374656 MSE: 490.2247448790313 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000596 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 2550 [LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 10 [LightGBM] [Info] Start training from score -1.374656 MSE: 490.2247448790313 Percentage differences (X divided by 10) > * median: 0.0 > * max: 0.0 > Percentage differences (X multiplied by 10) > * median: 0.0 > * max: 0.0 > preds > [ 19.257067 -180.82426361 -247.06530127 ... 145.98148316 -259.54477204 > 21.74138345] > [ 19.257067 -180.82426361 -247.06530127 ... 145.98148316 -259.54477204 > 21.74138345] > [ 19.257067 -180.82426361 -247.06530127 ... 145.98148316 -259.54477204 > 21.74138345]

aBeginner666 commented 3 years ago

Wow, thanks for such well written and informative reply! Def appreciate it!

After some thought, I totally agree with your analysis above, and rounding indeed lost some info, particulalrly for random generated data. What surprised me is that it can actually cause such significant diff in y. Here's my logic: 1. though rounding lost some info/precision in the feature, it seldom change the order of the feature value (since every entry is truncated). -> 2. the sorting and the split point shouldn't change much either -> 3. the grad and hess shouldn't change much, thus the leave values shouldn't either; -> 4. the y pred should be close. I beilieve there are some flaws in my logic, as well as some lightGBM behavior that I'm not aware of yet.

Any idea to the cause of this sensitivity or good way to handle it? I think this sensitivity is problem-dependent, as I was working with a very noisy project. I tried to reduce learning rate and increase num_boost_round and seems help a bit but not much. Kindly let me know if there are some behaviors that you think is related to it (e.g., discretization, sampling, etc.) and I will try. Otherwise I think that's the best I can do for now. ;) Thank again!

shiyu1994 commented 3 years ago

I think the unexpected behavior is largely due to there's no useful signals in the features in the example above. In other words, the features contains only noise. In that case, each feature will have similar split gains, and even a small disturbance can cause significant changes in the trained models.

In the following example, if we change the label to the sum of all features, the difference between round to 5 and 7 decimals will be small.

import numpy as np
import lightgbm as lgb

num_data = 1000000
num_train_data = num_data // 5 * 4
num_feature = 10

np.random.seed(1234)
X = np.random.rand(num_data, num_feature)
X_train = X[:num_train_data]
Y_train = np.sum(X_train, axis=1)
X_test = X[num_train_data:]
Y_test = np.sum(X_test, axis=1)

params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'num_threads': 10,
        'verbose': -1
}

X_train_7 = np.round(X_train, 7)
X_train_5 = np.round(X_train, 5)
X_test_7 = np.round(X_test, 7)
X_test_5 = np.round(X_test, 5)

lgb_train = lgb.Dataset(X_train_7, Y_train)
gbm = lgb.train(params, lgb_train)
Y_pred = gbm.predict(X_test_7)
print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2)))

lgb_train2 = lgb.Dataset(X_train_5, Y_train)
gbm2 = lgb.train(params, lgb_train2)
Y_pred2 = gbm2.predict(X_test_5)
print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2)))

diff_pct = np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2)
print("Diff pct: ", diff_pct)
print("Diff pct abs max: ", np.max(np.abs(diff_pct)))
print("Diff pct abs mean: ", np.mean(np.abs(diff_pct)))
print("Diff pct abs median: ", np.median(np.abs(diff_pct)))

which gives the following output

MSE:  0.00818904045355593
MSE:  0.008461063430555667
Diff pct:  [ 1.8  -0.95 -3.08 ... -1.34 -0.51  1.41]
Diff pct abs max:  12.78
Diff pct abs mean:  1.7339807499999997
Diff pct abs median:  1.43

no-response[bot] commented 3 years ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

Question: predicted value will change a lot if I add constant or multiply constant to the features #4424