lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

pseudotensor commented 4 years ago

version: 2.3.2

[LightGBM] [Fatal] Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

Traceback (most recent call last):
  File "lgb_prefit_4ff5fa97-86b3-420c-aa87-5f01abcc18c3.py", line 10, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 818, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 610, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2106, in update
    ctypes.byref(is_finished)))
  File "/home/jon/.pyenv/versions/3.6.7/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704

script and pickle file:

lgbm_histbug.zip

@sh1ng need help seeing if this is fixed in even later master.

guolinke commented 4 years ago

I think the latest master branch will not produce this error anymore, as cnt is removed in histogram.

But this still is a potential bug in GPU learner. ping @huanzhang12

sh1ng commented 4 years ago

On master

[LightGBM] [Fatal] Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .

Traceback (most recent call last):
  File "lgbm_histbug.py", line 8, in <module>
    model.fit(X, y, sample_weight=sample_weight, init_score=init_score, eval_set=eval_set, eval_names=valid_X_features, eval_sample_weight=eval_sample_weight, eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
    ctypes.byref(is_finished)))
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.right_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 706 .

guolinke commented 4 years ago

it is still a GPU bug. ping @huanzhang12

pseudotensor commented 4 years ago

@guFalcon @huanzhang12 FYI, we are tracking a major accuracy issue with latest lightgbm compared to before. This is just a heads-up, perhaps it's related to this issue. But we'll post a separate issue once we have moment to generate MRE.

guolinke commented 4 years ago

Thanks @pseudotensor , could the accuracy issue reproduce in CPU?

guolinke commented 4 years ago

BTW, maybe this is related: https://github.com/microsoft/LightGBM/pull/2811

pseudotensor commented 4 years ago

https://github.com/microsoft/LightGBM/issues/2813 yes, it's CPU run. Same setup with GPU hits this GPU histogram bug issue so can't be run.

But I think the GPU histogram is more generally occurring than the accuracy Issue #2813

guolinke commented 4 years ago

I think this may be fixed by #2811 too.

guolinke commented 4 years ago

So in the latest master branch, the CPU version is okay, while the GPU version failed?

sh1ng commented 4 years ago

@guolinke correct

stack trace of the error

/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py:893: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 22008
[LightGBM] [Info] Number of data points in the train set: 1348045, number of used features: 150
[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 138 dense feature groups (179.98 MB) transferred to GPU in 0.273129 secs. 1 sparse feature groups
[LightGBM] [Info] Start training from score -11.811581
[LightGBM] [Info] Start training from score -7.921803
[LightGBM] [Info] Start training from score -0.432866
[LightGBM] [Info] Start training from score -1.142893
[LightGBM] [Info] Start training from score -3.439298
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .

Traceback (most recent call last):
  File "lgb_accuracyissue.py", line 14, in <module>
    eval_init_score=init_score, eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds, feature_name=X_features, verbose=verbose_fit)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 829, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 614, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 250, in train
    booster.update(fobj=fobj)
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2145, in update
    ctypes.byref(is_finished)))
  File "/home/sh1ng/dev/.venv/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 46, in _safe_call
    raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: Check failed: best_split_info.left_count > 0 at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 702 .

sh1ng commented 4 years ago

Just letting you know that I'm unable to reproduce the issue with dataset originally provided, but it's easily reproducible with data from https://github.com/microsoft/LightGBM/issues/2813

imatiach-msft commented 4 years ago

@guolinke I'm trying to track down an issue where after upgrading to latest master branch in mmlspark I am seeing a similar error - any recommendations for code/commits I should look into to investigate what might be the root cause:

[LightGBM] [Warning] Set TCP_NODELAY failed [LightGBM] [Info] Trying to bind port 12422... [LightGBM] [Info] Binding port 12422 succeeded [LightGBM] [Warning] Set TCP_NODELAY failed [LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds [LightGBM] [Warning] Set TCP_NODELAY failed [LightGBM] [Info] Trying to bind port 12426... [LightGBM] [Info] Binding port 12426 succeeded [LightGBM] [Info] Listening... [LightGBM] [Info] Listening... [LightGBM] [Warning] Set TCP_NODELAY failed [LightGBM] [Warning] Set TCP_NODELAY failed [LightGBM] [Info] Connected to rank 1 [LightGBM] [Warning] Set TCP_NODELAY failed [LightGBM] [Info] Local rank: 0, total number of machines: 2 [LightGBM] [Info] Connected to rank 0 [LightGBM] [Info] Local rank: 1, total number of machines: 2 [LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= [LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= [LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true. This may cause significantly different results comparing to the previous versions of LightGBM. Try to set boost_from_average=false, if your old models produce bad results [LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true. This may cause significantly different results comparing to the previous versions of LightGBM. Try to set boost_from_average=false, if your old models produce bad results [LightGBM] [Info] Number of positive: 610, number of negative: 762 [LightGBM] [Info] Number of positive: 610, number of negative: 762 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000514 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 916 [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000664 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 916 [LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4 [LightGBM] [Info] Number of data points in the train set: 686, number of used features: 4 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.438776 -> initscore=-0.246133 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.450437 -> initscore=-0.198904 [LightGBM] [Info] Start training from score -0.222518 [LightGBM] [Info] Start training from score -0.222518 [LightGBM] [Info] Finished linking network in 0.003935 seconds [LightGBM] [Fatal] Check failed: best_split_info.left_count > 0 at /home/ilya/LightGBM/src/treelearner/serial_tree_learner.cpp, line 709 .

20/02/29 00:35:01 WARN LightGBMClassifier: LightGBM reached early termination on one worker, stopping training on worker. This message should rarely occur

guolinke commented 4 years ago

Could it run by only one node?

imatiach-msft commented 4 years ago

@guolinke amazing insight! I tried 1 node instead of 2 and almost all of my tests passed (except 1 test due to the number of nodes which is expected)

Here is the output from the same test as above (except it was successful):

[LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000942 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 327 [LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9 [LightGBM] [Info] Start training from score -1.572397 [LightGBM] [Info] Start training from score -1.618917 [LightGBM] [Info] Start training from score -2.024382 [LightGBM] [Info] Start training from score -1.955389 [LightGBM] [Info] Start training from score -1.890850 [LightGBM] [Info] Start training from score -1.773067 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002017 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 327 [LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9 [LightGBM] [Info] Start training from score -1.572397 [LightGBM] [Info] Start training from score -1.618917 [LightGBM] [Info] Start training from score -2.024382 [LightGBM] [Info] Start training from score -1.955389 [LightGBM] [Info] Start training from score -1.890850 [LightGBM] [Info] Start training from score -1.773067 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000835 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 327 [LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9 [LightGBM] [Info] Start training from score -1.572397 [LightGBM] [Info] Start training from score -1.618917 [LightGBM] [Info] Start training from score -2.024382 [LightGBM] [Info] Start training from score -1.955389 [LightGBM] [Info] Start training from score -1.890850 [LightGBM] [Info] Start training from score -1.773067 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] metric is set=, metric= will be ignored. Current value: metric= [LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001298 seconds. You can set force_col_wise=true to remove the overhead. [LightGBM] [Info] Total Bins 327 [LightGBM] [Info] Number of data points in the train set: 106, number of used features: 9 [LightGBM] [Info] Using GOSS [LightGBM] [Info] Start training from score -1.572397 [LightGBM] [Info] Start training from score -1.618917 [LightGBM] [Info] Start training from score -2.024382 [LightGBM] [Info] Start training from score -1.955389 [LightGBM] [Info] Start training from score -1.890850 [LightGBM] [Info] Start training from score -1.773067 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf

imatiach-msft commented 4 years ago

Note this is from this commit on 2/21 (both failing and successful runs): "Better documentation for Contributing (#2781)" I'm currently trying to work back to older versions/commits of lightgbm to see which commit is causing the tests to fail, but it is a slow process to build and update the jar and rerun the tests (I'm currently skipping small batches of commits at a time but I might do a binary search to make this optimal since it looks like the issue goes back before 2/21).

guolinke commented 4 years ago

@imatiach-msft you can try the commit (509c2e50c25eded99fc0997afe25ebee1b33285d) and its parent (https://github.com/microsoft/LightGBM/commit/bc7bc4a158d47bd9a12b89de21176e1e67a6e961)

imatiach-msft commented 4 years ago

@guolinke you're right, it looks like the issue is with commit (509c2e5). I validated that including that commit causes the error, and removing it fixes the issue.

guolinke commented 4 years ago

@imatiach-msft could you share the data (and config) to me for the debugging?

imatiach-msft commented 4 years ago

@guolinke I'm running the mmlspark scala tests, maybe I can try to create an example that you can easily run?
You can find the lightgbm classifier tests here: https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala

The first test that failed was below, but I tried several others and they failed as well: https://github.com/Azure/mmlspark/blob/master/src/test/scala/com/microsoft/ml/spark/lightgbm/split1/VerifyLightGBMClassifier.scala#L169

The compressed file with most datasets used in mmlspark can be found here: https://mmlspark.blob.core.windows.net/installers/datasets-2020-01-20.tgz

guolinke commented 4 years ago

@shiyu1994 con you help to investigate this too? you can start from @imatiach-msft 's test.

sh1ng commented 4 years ago

Still happens in version 3.0

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /root/repo/LightGBM/src/treelearner/serial_tree_learner.cpp, line 630

https://github.com/h2oai/h2o4gpu/blob/master/tests/python/open_data/gbm/test_lightgbm.py#L265-L284

shiyu1994 commented 4 years ago

@shiyu1994 con you help to investigate this too? you can start from @imatiach-msft 's test.

Ok.

imatiach-msft commented 4 years ago

@shiyu1994 @guolinke FYI my issue was resolved when I upgraded after my fix https://github.com/microsoft/LightGBM/pull/3110 , but it sounds like others are still encountering issues similar to what I had

diditforlulz273 commented 4 years ago

I have this issue with CPU learner, not GPU. Got it after upgrading from 2.3.1 to 3.0.0, makes every test with a tiny testing dataset fail for exactly the same reason:

lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /__w/1/s/python-package/compile/src/treelearner/serial_tree_learner.cpp, line 630 .

guolinke commented 4 years ago

@diditforlulz273 could you try the latest master branch? if the problem still exists, you can create a new issue and would be better if you can provide a reproducible example.

diditforlulz273 commented 4 years ago

@guolinke Have just built it from the latest master branch, still fails. I'll try to separate a minimum reproducible example and create an issue then.

grasevski commented 4 years ago

+1, this bug makes lightgbm GPU useless. still happens to me on latest master

asimraja77 commented 3 years ago

Hi, I'm using the GPU setting and have the same issue. I tried "deterministic = True" but it did not solve the problem. I saw that the LightGBM v3.2.0 may fix this defect. I have a few question as follows:

In the v3.2.0 release thread, I noticed that this bug #2793 is not in bold. Does this mean that it may not be fixed until a later release?
Does a fix exists for it in a non-release (build from source) option? If so, can you please guide me to it?
Assuming that a fix may be part of v3.2.0 release, is this release about to happen? I noticed that v3.1.1 was released 3 months ago.

I apologize if my questions are a bit out of bound. Best regards

nightflight-dk commented 3 years ago

It's unfortunate that a known issue of this severity is left open for over 1.5 years. The error affects every other attempt to train on GPUs when using the latest 'stable' bits in Business Division (Dynamics). I can help with a business case from inside Microsoft, to push this if necessary. My alias: dakowalc. Thanks

guolinke commented 3 years ago

Thank you @nightflight-dk , actually we had re-written the LightGBM GPU version, and previous OpenCL and CUDA versions will be deprecated. refer to PR https://github.com/microsoft/LightGBM/pull/4528

nightflight-dk commented 3 years ago

Great to hear the GPU acceleration is under further development @guolinke. I have just tested the code from PR #4528 unfortunately it's affected by the same bug, triggering the same assert error in the serial_tree_learner (even in data parallel exec. device=cuda / device=gpu) Please suggest a workaround or older version that is not affected (if any?) thanks

guolinke commented 3 years ago

cc @shiyu1994 for above bug.

shiyu1994 commented 3 years ago

I will double check that. But the new CUDA tree learners reuse no training logic of the old serial tree learner or old CUDA tree learner (only initialization code in serial_tree_learner.cpp is executed when a new CUDA tree learner is used, and it will not touch the check code which raises the error in this issue. Since the errors come from the source code of old CUDA tree learner and training part of serial tree learner), so I think it is not likely that the new CUDA version should result in the same bug.

shiyu1994 commented 3 years ago

@nightflight-dk Thanks for the testing. It would be really appreciated if the error log of the new CUDA version could be provided. :)

shiyu1994 commented 3 years ago

In addition, no distributed training is supported with the new CUDA versions in the PRs so far. So if distributed training is enabled, it will switch to old CUDA version.

nightflight-dk commented 3 years ago

@shiyu1994 @guolinke after disabling distribution (tree_learner: serial) the latest bits from PR #4528 finish the training without issues. Moreover the GPU utilization appears dramatically improved (mean up to ca. 50% from 2%). Well done. Is there an ETA for PR #4528 part of master? it would help our planning. Also, if you plan data_parallel GPU or multi-GPU, please point out the items for us to track. Happy to help with testing. Please keep up the good work. Thanks a lot. - dakowalc, Business 360 AI team

shiyu1994 commented 3 years ago

@nightflight-dk Thanks for having a trial. Since #4528 is a very large PR, we plan to decompose it into several parts, and merge them one by one. We expect to finish the merge process by the end of this month. Multi-GPU and distributed training will be added after #4528 is being merged. I will point that out once PRs are open for that.

pavlexander commented 2 years ago

Since there hasn't been any activity for a year, I would like to bring this topic up again.

Got the version 3.3.3, python. Training on GPU, on Windows.

The issue is bugging me for the past 2 days.. The data set is 500k, with 1500 features. There seems to be some correlation with min_gain_to_split parameter. When the value is 1 I have not yet seen any errors, however on value 0 (default) it seems to crash quite often. Take this comment with caution since I have not ran enough tests yet..

crashed when

{'learning_rate': 0.43467624523546383, 'max_depth': 8, 'num_leaves': 201, 'feature_fraction': 0.9, 'bagging_fraction': 0.7000000000000001, 'bagging_freq': 8}

{'learning_rate': 0.021403440298427053, 'max_depth': 2, 'num_leaves': 176, 'lambda_l1': 3.8066251775052895, 'lambda_l2': 1.08526150100961e-08, 'feature_fraction': 0.6, 'bagging_fraction': 0.9, 'bagging_freq': 6}

{'learning_rate': 0.3493368922746614, 'max_depth': 6, 'num_leaves': 109, 'lambda_l1': 4.506588272812341e-05, 'lambda_l2': 2.5452579091348995e-07, 'feature_fraction': 0.7000000000000001, 'bagging_fraction': 1.0, 'bagging_freq': 6, 'min_gain_to_split': 0}

{'learning_rate': 0.17840010040986135, 'max_depth': 12, 'num_leaves': 251, 'lambda_l1': 0.004509589012189404, 'lambda_l2': 3.882151732343819e-08, 'feature_fraction': 0.30000000000000004, 'bagging_fraction': 1.0, 'bagging_freq': 8, 'min_gain_to_split': 0}

the code is:

    params = {
        'device_type': "gpu",
        'objective': 'multiclass',  # 
        'metric': 'multi_logloss',  # 
        "boosting_type": "gbdt",
        "num_class": 3,
        'random_state': 123,
        'verbosity': -1,  # hides "No further splits with positive gain, best gain: -inf" warnings
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.9, log=True),  # 0.1
        'max_depth': trial.suggest_int('max_depth', 2, 12),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),  # def 31
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),  # 0
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),  # 0
        'feature_fraction': trial.suggest_float('feature_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.1, 1.0, step=0.1),  # 1
        'bagging_freq': trial.suggest_int('bagging_freq', 0, 10),  # 0
        'min_gain_to_split': trial.suggest_int('min_gain_to_split', 0, 5),
    }

with a few changes here and there

exception is:

[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

[W 2022-11-07 09:49:32,774] Trial 49 failed because of the following error: LightGBMError('Check failed: (best_split_info.left_count) > (0) at D:\\a\\1\\s\\python-package\\compile\\src\\treelearner\\serial_tree_learner.cpp, line 653 .\n')
Traceback (most recent call last):
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

Traceback (most recent call last):
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 237, in <module>
    study.optimize(objective, n_trials=_NUMBER_OF_TRIALS)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\study.py", line 419, in optimize
    _optimize(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 160, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 234, in _run_trial
    raise func_err
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "D:\dev\Pycharm2022\LearningCNN\test9_realData4_optuna.py", line 174, in objective
    model = lgb.train(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\engine.py", line 292, in train
    booster.update(fobj=fobj)
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 3021, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "D:\dev\Py_Global_vEnv_2022\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at D:\a\1\s\python-package\compile\src\treelearner\serial_tree_learner.cpp, line 653 .

Process finished with exit code 1

I am using optuna for optimization so the set of parameters is always different.

Tried using different split ratio (0.19/0.20/0.21) - does not seem to fix anything

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.19, random_state=42, shuffle=True)

as well as tried experimenting with the amount of data (600_000/600_001/200_001). Nothing seems to help fix the issue.. Can this fix be expected in the next major release? I see that the topic is still active..

JisongXie commented 1 year ago

I build the docker image with this dockerfile.gpu. And I encounter this issue, too.

LightGBMError: Check failed: (best_split_info.left_count) > (0) at /usr/local/src/lightgbm/LightGBM/src/treelearner/serial_tree_learner.cpp, line 653 .

microsoft / LightGBM

lightgbm.basic.LightGBMError: Bug in GPU histogram! split 11937: 12, smaller_leaf: 10245, larger_leaf: 1704 #2793