Closed aBeginner666 closed 3 years ago
Thanks very much for using LightGBM!
Can you please provide a reproducible example, some self-contained code that maintainers could run to understand exactly what you are doing? Without that, you are asking us to guess what your code is doing (for example, you are asking us to guess what "a lot" means), which will delay the time it takes to resolve this issue and pull maintainers' limited attention away from other parts of the project.
I think your expectation is correct, that scaling your features should generally not be required, but there are many caveats to that and I can't give a more specific answer without seeing your specific code.
Hi @aBeginner666, thanks for using LightGBM. LightGBM will discretize the features before training. With feature values being added to or multiplied by a constant, the discretization process can be influenced, but a big change in the prediction values is not expected. Could you please provide an example to demonstrate your case? Thanks.
Hi James/Shiyu,
Thanks a lot for providing info. Please see the below toy code, the real application has more weird distributions of features.
It seems that if I get rid of the rounding, the issue would disappear; and if I use rounding, and scale the feature down a bit, though the MSE didn't change much, the Y_pred would change considerably in terms of percentage. I think roudning would cause some slight change, e.g., 1% or 5%, but it doesn't make sense to me if Y_pred could change like 50%. Thanks!
np.random.seed(1234)
Y = (np.random.rand(1000000) - 0.5) / 1000.
X = np.random.rand(1000000, 100)
X_train = X[:800000]
Y_train = Y[:800000]
X_test = X[800000:]
Y_test = Y[800000:]
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'num_threads': 10,
'verbose': -1
}
lgb_train = lgb.Dataset(np.round(X_train, 7), Y_train)
gbm = lgb.train(params, lgb_train)
Y_pred = gbm.predict(np.round(X_test, 7))
print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2)))
lgb_train2 = lgb.Dataset(np.round(X_train / 10., 7), Y_train)
gbm2 = lgb.train(params, lgb_train2)
Y_pred2 = gbm2.predict(np.round(X_test / 10., 7))
print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2)))
print("Diff pct: ", np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2))
Thanks very much, taking a look at this right now.
By the way, I've edited your comment to use code formatting from GitHub, so it is easier to read. In case you are not familiar with how to do this on GitHub, please see https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.
Ok @aBeginner666 , I believe I know what's happening here!
Running a modified version of the code you provided, with the latest version of lightgbm
installed from master
(as of the most recent commit, https://github.com/microsoft/LightGBM/commit/ec1debcee84e6b02f749282d9a501021370e74a9), I was able to reproduce the behavior you observed, where the predictions differ by a larger amount in percentage terms.
I think that I see the issue. Your use of rounding in the example code is what is causing this difference. When I remove rounding, the predictions are identical regardless of multiplying or dividing features by a constant.
So said another way, you are losing some information by rounding, and that makes the results not quite comparable to each other because the training data in the different cases are genuinely different, not just scaled versions of the same data.
round(x / 10, 7)
is not the same as round(x, 7) / 10
.
x = 5.123456789
round(x / 10, 7)
# 0.5123457
round(x, 7) / 10
# 0.5123456799999999
These differences matter for totally-random data.
To confirm that multiplying all features by a constant does not change the predictions, you could also try the sample code below using scikit-learn
's make_regression()
, which generates features that are correlated with the target.
Wow, thanks for such well written and informative reply! Def appreciate it!
After some thought, I totally agree with your analysis above, and rounding indeed lost some info, particulalrly for random generated data. What surprised me is that it can actually cause such significant diff in y. Here's my logic: 1. though rounding lost some info/precision in the feature, it seldom change the order of the feature value (since every entry is truncated). -> 2. the sorting and the split point shouldn't change much either -> 3. the grad and hess shouldn't change much, thus the leave values shouldn't either; -> 4. the y pred should be close. I beilieve there are some flaws in my logic, as well as some lightGBM behavior that I'm not aware of yet.
Any idea to the cause of this sensitivity or good way to handle it? I think this sensitivity is problem-dependent, as I was working with a very noisy project. I tried to reduce learning rate and increase num_boost_round and seems help a bit but not much. Kindly let me know if there are some behaviors that you think is related to it (e.g., discretization, sampling, etc.) and I will try. Otherwise I think that's the best I can do for now. ;) Thank again!
I think the unexpected behavior is largely due to there's no useful signals in the features in the example above. In other words, the features contains only noise. In that case, each feature will have similar split gains, and even a small disturbance can cause significant changes in the trained models.
In the following example, if we change the label to the sum of all features, the difference between round to 5 and 7 decimals will be small.
import numpy as np
import lightgbm as lgb
num_data = 1000000
num_train_data = num_data // 5 * 4
num_feature = 10
np.random.seed(1234)
X = np.random.rand(num_data, num_feature)
X_train = X[:num_train_data]
Y_train = np.sum(X_train, axis=1)
X_test = X[num_train_data:]
Y_test = np.sum(X_test, axis=1)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'num_threads': 10,
'verbose': -1
}
X_train_7 = np.round(X_train, 7)
X_train_5 = np.round(X_train, 5)
X_test_7 = np.round(X_test, 7)
X_test_5 = np.round(X_test, 5)
lgb_train = lgb.Dataset(X_train_7, Y_train)
gbm = lgb.train(params, lgb_train)
Y_pred = gbm.predict(X_test_7)
print("MSE: ", np.mean(np.power(Y_pred - Y_test, 2)))
lgb_train2 = lgb.Dataset(X_train_5, Y_train)
gbm2 = lgb.train(params, lgb_train2)
Y_pred2 = gbm2.predict(X_test_5)
print("MSE: ", np.mean(np.power(Y_pred2 - Y_test, 2)))
diff_pct = np.round((Y_pred - Y_pred2) / Y_pred2 * 100, 2)
print("Diff pct: ", diff_pct)
print("Diff pct abs max: ", np.max(np.abs(diff_pct)))
print("Diff pct abs mean: ", np.mean(np.abs(diff_pct)))
print("Diff pct abs median: ", np.median(np.abs(diff_pct)))
which gives the following output
MSE: 0.00818904045355593
MSE: 0.008461063430555667
Diff pct: [ 1.8 -0.95 -3.08 ... -1.34 -0.51 1.41]
Diff pct abs max: 12.78
Diff pct abs mean: 1.7339807499999997
Diff pct abs median: 1.43
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Hi there,
I'm working on a regression problem with lightGBM 2.2.3, and I encountered a weird issue:
If I add a constant or multiply a positive constant to all features, the predicted value would also change a lot. This doesn't make sense to me given that objective = regression; boosting_type = gbdt; no lambda_l1 or lambda_l2。
In my opinion, the above operation shouldn't impact the histogram, the split point; also if we plug in the sum_gradient / (sum_hessian + lambda) formula, the leaf value shouldn't change either.
Any thoughts would be greatly appreciated!