microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.7k stars 3.83k forks source link

bin size 257 cannot run on GPU #4082

Open pseudotensor opened 3 years ago

pseudotensor commented 3 years ago

I know there are a couple other issues that mention this problem, but it's gotten messy with suggestions it's related to categorical_feature setting and other stuff. Here is clean MRE.

d9a96c90cb479cef87047ba20517d97982b563eb

lgb257.pkl.zip

import pickle
model, X, y, kwargs = pickle.load(open(lgb257.pkl, "rb"))
model.fit(X, y, **kwargs)

FYI a model.get_params() shows:

params = {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'gain',
          'learning_rate': 0.5, 'max_depth': 6, 'min_child_samples': 1, 'min_child_weight': 1.0, 'min_split_gain': 0.0,
          'n_estimators': 100, 'n_jobs': 8, 'num_leaves': 64, 'objective': 'binary', 'random_state': 1234,
          'reg_alpha': 0.0, 'reg_lambda': 1.0, 'silent': True, 'subsample': 0.7, 'subsample_for_bin': 200000,
          'subsample_freq': 1, 'pred_gap': None, 'pred_periods': None, 'max_bin': 255, 'scale_pos_weight': 1.0,
          'max_delta_step': 0.0, 'min_data_in_bin': 1, 'seed': 1234, 'early_stopping_limit': None, 'device_type': 'gpu',
          'gpu_device_id': 0, 'gpu_platform_id': 0, 'gpu_use_dp': True, 'feature_fraction_seed': 1235,
          'bagging_seed': 1236, 'num_threads': 8, 'num_class': 1, 'verbose': -1, 'categorical_feature': ''}

and FYI here is kwargs:

image

[LightGBM] [Warning] num_threads is set=8, n_jobs=8 will be ignored. Current value: num_threads=8
[LightGBM] [Warning] seed is set=1234, random_state=1234 will be ignored. Current value: seed=1234
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1586: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1590: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1108: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/home/jon/h2oai.fullcondatest/h2oaicore/lgb257.py", line 18, in <module>
    model.fit(X, y, **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 867, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2104, in __init__
    ctypes.byref(self.handle)))
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Running

model.fit(X, y)

fails same way, but I'm unsure for sklearn API if it is using 'auto' for categorical_feature then.

pseudotensor commented 3 years ago

Here is more minimal MRE:

import pickle
X, y = pickle.load(open("lgb257b.pkl", "rb"))

params = dict(categorical_feature='', device_type='gpu', gpu_device_id=0, gpu_platform_id=0, min_data_in_bin=1, max_bin=255)
model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature='')

FYI gpu_use_dp=True or False has no effect.

That is, I iterated through all parameters, the key to failure is (of course) on GPU but also min_data_in_bin=1. 2 also fails, but 10 does not fail. So lgb is not respecting the max_bin of 255 even for numeric values.

lgb257b.pkl.zip

If this is a user error, I recommend listening primarily to max_bin. E.g. when doing hyperparameter search, fatal failures are not fun to handle. Best if lgb does reasonable thing.

pseudotensor commented 3 years ago

Hi, any thoughts? Seems like a clear MRE, but it's been 5 days and no response. Thanks.

pseudotensor commented 3 years ago

@guolinke ?

pseudotensor commented 3 years ago
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/sklearn.py", line 712, in fit
    self._Booster = train(params, train_set,
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/engine.py", line 235, in train
    booster = Booster(params=params, train_set=train_set)
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 2528, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 258 cannot run on GPU

Again, no categorical handling enabled etc.

This is on master as of last night.

arnocandel commented 3 years ago

@guolinke reminder - still the dominant failure mode for LightGBM in Driverless AI

guolinke commented 3 years ago

I think the old GPU/CUDA version will be abandoned. also cc @shiyu1994 to follow up on this issue.

shiyu1994 commented 3 years ago

@arnocandel We are updating a branch new CUDA version. Please follow #4630 and #4528 for latest progress.

pseudotensor commented 3 years ago

@shiyu1994 and @guolinke . Hi, Looking at those 2 PRs made me realize that perhaps the current CUDA mode (as opposed to openCL) is incomplete. e.g. you mention categorical handling as added to CUDA version in the PR. Is that correct?

More generally, is the CUDA version incomplete in various ways that are documented? Or does it have (or will have) full parity?

If I run with CUDA version with categorical handling it seems to run fine, but maybe it's not doing what I choose even though I pass categorical_feature?

shiyu1994 commented 3 years ago

@pseudotensor The current CUDA version is doing the correct thing, it can handle categorical features normally. The only problem is current implementation only do histogram construction on GPU, so the GPU utilization can be low.

Supporting of categorical features is not added yet in our first part of new CUDA version #4630, but will be added later.

arnocandel commented 2 years ago

Here's another minimal repro, in case helps

lgb.bin257.pkl.zip

import pickle
import lightgbm as lgb
print(lgb.__version__)

from lightgbm.sklearn import LGBMRegressor
with open("lgb.bin257.pkl", "rb") as f:
    X, y = pickle.load(f)
    model = LGBMRegressor(max_bin=252, device_type='gpu')
    model.fit(X, y)
    print("OK1")

    model = LGBMRegressor(max_bin=253, device_type='gpu')
    model.fit(X, y)
    print("OK2")

first one passes, second one fails, not sure where 257 comes from:

3.2.1.99
OK1
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/nfs4/lgb_prefit_1c95733f-58d6-4a61-969f-b2331e03e895.py", line 13, in <module>
    model.fit(X, y)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 851, in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 714, in fit
    self._Booster = train(params, train_set,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/engine.py", line 260, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 2537, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Process finished with exit code 1
jameslamb commented 2 years ago

Thanks very much @arnocandel !

But are you able to provide a reproducible example starting from raw data in a text-based format, generated from scratch with pandas / numpy / scipy code, or using a widely-distributed dataset like those available in sklearn.datasets?

I personally don't ever load pickle files whose origin I don't know, and I expect others wanting to contribute to fixing this issue might share that hesistation.

From https://docs.python.org/3/library/pickle.html

Warning The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

arnocandel commented 2 years ago

@jameslamb - ok use this instead: X_y.zip

import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()
lewis-morris commented 2 years ago

I'm having the same issue over here!

bin size 257 cannot run on GPU

arnocandel commented 2 years ago

@jameslamb - were you able to check with above two .csv files for X and y?

Here the full thing for simplicity: https://github.com/microsoft/LightGBM/files/7817145/X_y.zip

import lightgbm as lgb
print(lgb.__version__)
import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

from lightgbm.sklearn import LGBMRegressor
model = LGBMRegressor(max_bin=252, device_type='gpu')
model.fit(X, y)
print("OK1")

model = LGBMRegressor(max_bin=253, device_type='gpu')
model.fit(X, y)
print("OK2")
jameslamb commented 2 years ago

were you able to check with above two .csv files for X and y

I was not. If you're subscribed to this issue, you'll be notified when someone picks this up or has new information to share.

jiluojiluo commented 2 years ago

this is a bug for lightGBM for GPU,when use CPU,it is OK.

ahmedshahriar commented 1 year ago

Any update so far on this issue?

lilianabs commented 1 year ago

I'm having the same issue :(

chixujohnny commented 1 year ago

same issue too :(

holma91 commented 7 months ago

Still have this issue.

matousfamera commented 6 months ago

I have the same issue

"LightGBMError: bin size 1973 cannot run on GPU."

It runs alright using CPU.

shiyu1994 commented 6 months ago

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

cocoderss commented 3 months ago

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

I have followed these instructions to install the CUDA version instead of the GPU version, but I still have the same issue: LightGBMError: bin size XXX cannot run on GPU.

For more info, I am running on a linux server with cuda 12.1 with A100. Let me know if more info are needed to fix this issue.

wil70 commented 3 months ago

Same issue with GPU version on windows, works fine on CPU [LightGBM] [Fatal] bin size 260 cannot run on GPU

[LightGBM] [Info] Finished loading parameters [LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin [LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file. [LightGBM] [Info] Finished loading data in 286.006354 seconds [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 278556290 [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793 [LightGBM] [Fatal] bin size 260 cannot run on GPU Met Exceptions: bin size 260 cannot run on GPU

cocoderss commented 2 months ago

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

I have realized that after compiling lightgbm with the cuda option, and then using the command sudo sh ./build-python.sh install --precompile to install it as highlighted in the documentation, it defaults to installing the pip repo version. I have not verified that by inspecting the build-python.sh script, but my workaround was to build the pip wheel package myself. This solves the issue, and when specifying device_type=cuda works correctly as expected.

On a side note, the main issue of cuda memory still persists, and this relates to the fact that a categorical feature has too many unique values (I tested by omitting that feature and it works fine on both gpu, cuda and cpu). But when including that feature, using the gpu version I get LightGBMError: bin size XXX cannot run on GPU, it works fine on the CPU, but takes a very long time, and using the cuda version, you can find the error below (optuna study multiple workers).

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/treelearner/cuda/cuda_best_split_finder.cu 2066

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

terminate called after throwing an instance of 'std::runtime_error'
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

  what():  [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

So it seems that there is a limitation in the implementation when it comes to categorical features on cuda/gpu, that requires a fix.

yuhorun commented 11 hours ago

I have the same issue

"LightGBMError: bin size 512 cannot run on GPU."