Open pseudotensor opened 4 years ago
I think the latest master branch will not produce this error anymore, as cnt
is removed in histogram.
But this still is a potential bug in GPU learner. ping @huanzhang12
On master
[LightGBM] [Fatal] Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .
Traceback (most recent call last):
File "lgbm_histbug.py", line 8, in <module>
model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
callbacks=callbacks, init_model=init_model)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
callbacks=callbacks, init_model=init_model)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
booster.update(fobj=fobj)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
ctypes.byref(is_finished)))
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .
it is still a GPU bug. ping @huanzhang12
@guFalcon @huanzhang12 FYI, we are tracking a major accuracy issue with latest lightgbm compared to before. This is just a heads-up, perhaps it's related to this issue. But we'll post a separate issue once we have moment to generate MRE.
Thanks @pseudotensor , could the accuracy issue reproduce in CPU?
BTW, maybe this is related: https://github.com/microsoft/LightGBM/pull/2811
https://github.com/microsoft/LightGBM/issues/2813 yes, it's CPU run. Same setup with GPU hits this GPU histogram bug issue so can't be run.
But I think the GPU histogram is more generally occurring than the accuracy Issue #2813
I think this may be fixed by #2811 too.
So in the latest master branch, the CPU version is okay, while the GPU version failed?
@guolinke correct
stack trace of the error
/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py:893: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
.format(key))
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 22008
[LightGBM] [Info] Number of data points in the train set: 1348045, number of used features: 150
[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 138 dense feature groups (179.98 MB) transferred to GPU in 0.273129 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score -11.811581
[LightGBM] [Info] Start training from score -7.921803
[LightGBM] [Info] Start training from score -0.432866
[LightGBM] [Info] Start training from score -1.142893
[LightGBM] [Info] Start training from score -3.439298
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .
Traceback (most recent call last):
File "lgb_accuracyissue.py", line 14, in <module>
eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
callbacks=callbacks, init_model=init_model)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
callbacks=callbacks, init_model=init_model)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
booster.update(fobj=fobj)
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
ctypes.byref(is_finished)))
File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .
Just letting you know that I'm unable to reproduce the issue with dataset originally provided, but it's easily reproducible with data from https://github.com/microsoft/LightGBM/issues/2813
@guolinke I'm trying to track down an issue where after upgrading to latest master branch in mmlspark I am seeing a similar error - any recommendations for code/commits I should look into to investigate what might be the root cause:
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12422...
[LightGBM] [Info] Binding port 12422 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 12426...
[LightGBM] [Info] Binding port 12426 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Local rank: 0, total number of machines: 2
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Local rank: 1, total number of machines: 2
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 610, number of negative: 762
[LightGBM] [Info] Number of positive: 610, number of negative: 762
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000514 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 916
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000664 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 916
[LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4
[LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.438776 -> initscore=-0.246133
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.450437 -> initscore=-0.198904
[LightGBM] [Info] Start training from score -0.222518
[LightGBM] [Info] Start training from score -0.222518
[LightGBM] [Info] Finished linking network in 0.003935 seconds
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /home/ilya/LightGBM/src/treelearner/serial_tree_learner.cpp, line 709 .
20/02/29 00:35:01 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur
Could it run by only one node?
@guolinke amazing insight! I tried 1 node instead of 2 and almost all of my tests passed (except 1 test due to the number of nodes which is expected)
Here is the output from the same test as above (except it was successful):
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000942 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002017 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000835 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric=
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001298 seconds.
You can set force_col_wise=true
to remove the overhead.
[LightGBM] [Info] Total Bins 327
[LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9
[LightGBM] [Info] Using GOSS
[LightGBM] [Info] Start training from score -1.572397
[LightGBM] [Info] Start training from score -1.618917
[LightGBM] [Info] Start training from score -2.024382
[LightGBM] [Info] Start training from score -1.955389
[LightGBM] [Info] Start training from score -1.890850
[LightGBM] [Info] Start training from score -1.773067
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Note this is from this commit on 2/21 (both failing and successful runs): "Better documentation for Contributing (#2781)" I'm currently trying to work back to older versions/commits of lightgbm to see which commit is causing the tests to fail, but it is a slow process to build and update the jar and rerun the tests (I'm currently skipping small batches of commits at a time but I might do a binary search to make this optimal since it looks like the issue goes back before 2/21).
@imatiach-msft you can try the commit (509c2e50c25eded99fc0997afe25ebee1b33285d) and its parent (https://github.com/microsoft/LightGBM/commit/bc7bc4a158d47bd9a12b89de21176e1e67a6e961)
@guolinke you're right, it looks like the issue is with commit (509c2e5). I validated that including that commit causes the error, and removing it fixes the issue.
@imatiach-msft could you share the data (and config) to me for the debugging?
@guolinke I'm running the mmlspark scala tests, maybe I can try to create an example that you can easily run?
You can find the lightgbm classifier tests here:
https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala
The first test that failed was below, but I tried several others and they failed as well: https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala#L169
The compressed file with most datasets used in mmlspark can be found here: https://mmlspark.blob.core.windows.net/installers/datasets-2020-01-20.tgz
@shiyu1994 con you help to investigate this too? you can start from @imatiach-msft 's test.
Still happens in version 3.0
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 630
https://github.com/h2oai/h2o4gpu/blob/master/tests/python/open_data/gbm/test_lightgbm.py#L265-L284
@shiyu1994 con you help to investigate this too? you can start from @imatiach-msft 's test.
Ok.
@shiyu1994 @guolinke FYI my issue was resolved when I upgraded after my fix https://github.com/microsoft/LightGBM/pull/3110 , but it sounds like others are still encountering issues similar to what I had
I have this issue with CPU learner, not GPU. Got it after upgrading from 2.3.1 to 3.0.0, makes every test with a tiny testing dataset fail for exactly the same reason:
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 630 .
@diditforlulz273 could you try the latest master branch? if the problem still exists, you can create a new issue and would be better if you can provide a reproducible example.
@guolinke Have just built it from the latest master branch, still fails. I'll try to separate a minimum reproducible example and create an issue then.
+1, this bug makes lightgbm GPU useless. still happens to me on latest master
Hi, I'm using the GPU setting and have the same issue. I tried "deterministic = True" but it did not solve the problem. I saw that the LightGBM v3.2.0 may fix this defect. I have a few question as follows:
I apologize if my questions are a bit out of bound. Best regards
It's unfortunate that a known issue of this severity is left open for over 1.5 years. The error affects every other attempt to train on GPUs when using the latest 'stable' bits in Business Division (Dynamics). I can help with a business case from inside Microsoft, to push this if necessary. My alias: dakowalc. Thanks
Thank you @nightflight-dk , actually we had re-written the LightGBM GPU version, and previous OpenCL and CUDA versions will be deprecated. refer to PR https://github.com/microsoft/LightGBM/pull/4528
Great to hear the GPU acceleration is under further development @guolinke. I have just tested the code from PR #4528 unfortunately it's affected by the same bug, triggering the same assert error in the serial_tree_learner (even in data parallel exec. device=cuda / device=gpu) Please suggest a workaround or older version that is not affected (if any?) thanks
cc @shiyu1994 for above bug.
I will double check that. But the new CUDA tree learners reuse no training logic of the old serial tree learner or old CUDA tree learner (only initialization code in serial_tree_learner.cpp is executed when a new CUDA tree learner is used, and it will not touch the check code which raises the error in this issue. Since the errors come from the source code of old CUDA tree learner and training part of serial tree learner), so I think it is not likely that the new CUDA version should result in the same bug.
@nightflight-dk Thanks for the testing. It would be really appreciated if the error log of the new CUDA version could be provided. :)
In addition, no distributed training is supported with the new CUDA versions in the PRs so far. So if distributed training is enabled, it will switch to old CUDA version.
@shiyu1994 @guolinke after disabling distribution (tree_learner: serial) the latest bits from PR #4528 finish the training without issues. Moreover the GPU utilization appears dramatically improved (mean up to ca. 50% from 2%). Well done. Is there an ETA for PR #4528 part of master? it would help our planning. Also, if you plan data_parallel GPU or multi-GPU, please point out the items for us to track. Happy to help with testing. Please keep up the good work. Thanks a lot. - dakowalc, Business 360 AI team
@nightflight-dk Thanks for having a trial. Since #4528 is a very large PR, we plan to decompose it into several parts, and merge them one by one. We expect to finish the merge process by the end of this month. Multi-GPU and distributed training will be added after #4528 is being merged. I will point that out once PRs are open for that.
Since there hasn't been any activity for a year, I would like to bring this topic up again.
Got the version 3.3.3, python. Training on GPU, on Windows.
The issue is bugging me for the past 2 days.. The data set is 500k, with 1500 features. There seems to be some correlation with min_gain_to_split
parameter. When the value is 1 I have not yet seen any errors, however on value 0 (default) it seems to crash quite often. Take this comment with caution since I have not ran enough tests yet..
crashed when
{'learning_rate': 0.43467624523546383, 'max_depth': 8, 'num_leaves': 201, 'feature_fraction': 0.9, 'bagging_fraction': 0.7000000000000001, 'bagging_freq': 8}
{'learning_rate': 0.021403440298427053, 'max_depth': 2, 'num_leaves': 176, 'lambda_l1': 3.8066251775052895, 'lambda_l2': 1.08526150100961e-08, 'feature_fraction': 0.6, 'bagging_fraction': 0.9, 'bagging_freq': 6}
{'learning_rate': 0.3493368922746614, 'max_depth': 6, 'num_leaves': 109, 'lambda_l1': 4.506588272812341e-05, 'lambda_l2': 2.5452579091348995e-07, 'feature_fraction': 0.7000000000000001, 'bagging_fraction': 1.0, 'bagging_freq': 6, 'min_gain_to_split': 0}
{'learning_rate': 0.17840010040986135, 'max_depth': 12, 'num_leaves': 251, 'lambda_l1': 0.004509589012189404, 'lambda_l2': 3.882151732343819e-08, 'feature_fraction': 0.30000000000000004, 'bagging_fraction': 1.0, 'bagging_freq': 8, 'min_gain_to_split': 0}
the code is:
params = {
'device_type': "gpu",
'objective': 'multiclass', #
'metric': 'multi_logloss', #
"boosting_type": "gbdt",
"num_class": 3,
'random_state': 123,
'verbosity': -1, # hides "No further splits with positive gain, best gain: -inf" warnings
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.9, log=True), # 0.1
'max_depth': trial.suggest_int('max_depth', 2, 12),
'num_leaves': trial.suggest_int('num_leaves', 2, 256), # def 31
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True), # 0
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True), # 0
'feature_fraction': trial.suggest_float('feature_fraction', 0.1, 1.0, step=0.1), # 1
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.1, 1.0, step=0.1), # 1
'bagging_freq': trial.suggest_int('bagging_freq', 0, 10), # 0
'min_gain_to_split': trial.suggest_int('min_gain_to_split', 0, 5),
}
with a few changes here and there
exception is:
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .
[W 2022-11-07 09:49:32,774] Trial 49 failed because of the following error: LightGBMError('Check failed: (best_split_info.left_count) > (0) at D:\\a\\1\\s\\python-package\\compile\\src\\treelearner\\serial_tree_learner.cpp, line 653 .\n')
Traceback (most recent call last):
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
value_or_values = func(trial)
File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
model = lgb.train(
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
booster.update(fobj=fobj)
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .
Traceback (most recent call last):
File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 237, in <module>
study.optimize(objective, n_trials=_NUMBER_OF_TRIALS)
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\study.py", line 419, in optimize
_optimize(
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
_optimize_sequential(
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 160, in _optimize_sequential
frozen_trial = _run_trial(study, func, catch)
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 234, in _run_trial
raise func_err
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
value_or_values = func(trial)
File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
model = lgb.train(
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
booster.update(fobj=fobj)
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .
Process finished with exit code 1
I am using optuna
for optimization so the set of parameters is always different.
Tried using different split ratio (0.19/0.20/0.21) - does not seem to fix anything
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.19, random_state=42, shuffle=True)
as well as tried experimenting with the amount of data (600_000/600_001/200_001). Nothing seems to help fix the issue.. Can this fix be expected in the next major release? I see that the topic is still active..
I build the docker image with this dockerfile.gpu. And I encounter this issue, too.
LightGBMError: Check failed: (best_split_info.left_count) > (0) at /usr/local/src/lightgbm/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .
version: 2.3.2
script and pickle file:
lgbm_histbug.zip
@sh1ng need help seeing if this is fixed in even later master.