Open pseudotensor opened 3 years ago
Here is more minimal MRE:
import pickle
X, y = pickle.load(open("lgb257b.pkl", "rb"))
params = dict(categorical_feature='', device_type='gpu', gpu_device_id=0, gpu_platform_id=0, min_data_in_bin=1, max_bin=255)
model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature='')
FYI gpu_use_dp=True or False has no effect.
That is, I iterated through all parameters, the key to failure is (of course) on GPU but also min_data_in_bin=1. 2 also fails, but 10 does not fail. So lgb is not respecting the max_bin of 255 even for numeric values.
If this is a user error, I recommend listening primarily to max_bin. E.g. when doing hyperparameter search, fatal failures are not fun to handle. Best if lgb does reasonable thing.
Hi, any thoughts? Seems like a clear MRE, but it's been 5 days and no response. Thanks.
@guolinke ?
File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/sklearn.py", line 712, in fit
self._Booster = train(params, train_set,
File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/engine.py", line 235, in train
booster = Booster(params=params, train_set=train_set)
File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 2528, in __init__
_safe_call(_LIB.LGBM_BoosterCreate(
File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 258 cannot run on GPU
Again, no categorical handling enabled etc.
This is on master as of last night.
@guolinke reminder - still the dominant failure mode for LightGBM in Driverless AI
I think the old GPU/CUDA version will be abandoned. also cc @shiyu1994 to follow up on this issue.
@arnocandel We are updating a branch new CUDA version. Please follow #4630 and #4528 for latest progress.
@shiyu1994 and @guolinke . Hi, Looking at those 2 PRs made me realize that perhaps the current CUDA mode (as opposed to openCL) is incomplete. e.g. you mention categorical handling as added to CUDA version in the PR. Is that correct?
More generally, is the CUDA version incomplete in various ways that are documented? Or does it have (or will have) full parity?
If I run with CUDA version with categorical handling it seems to run fine, but maybe it's not doing what I choose even though I pass categorical_feature?
@pseudotensor The current CUDA version is doing the correct thing, it can handle categorical features normally. The only problem is current implementation only do histogram construction on GPU, so the GPU utilization can be low.
Supporting of categorical features is not added yet in our first part of new CUDA version #4630, but will be added later.
Here's another minimal repro, in case helps
import pickle
import lightgbm as lgb
print(lgb.__version__)
from lightgbm.sklearn import LGBMRegressor
with open("lgb.bin257.pkl", "rb") as f:
X, y = pickle.load(f)
model = LGBMRegressor(max_bin=252, device_type='gpu')
model.fit(X, y)
print("OK1")
model = LGBMRegressor(max_bin=253, device_type='gpu')
model.fit(X, y)
print("OK2")
first one passes, second one fails, not sure where 257 comes from:
3.2.1.99
OK1
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
File "/nfs4/lgb_prefit_1c95733f-58d6-4a61-969f-b2331e03e895.py", line 13, in <module>
model.fit(X, y)
File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 851, in fit
super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 714, in fit
self._Booster = train(params, train_set,
File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/engine.py", line 260, in train
booster = Booster(params=params, train_set=train_set)
File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 2537, in __init__
_safe_call(_LIB.LGBM_BoosterCreate(
File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU
Process finished with exit code 1
Thanks very much @arnocandel !
But are you able to provide a reproducible example starting from raw data in a text-based format, generated from scratch with pandas
/ numpy
/ scipy
code, or using a widely-distributed dataset like those available in sklearn.datasets
?
I personally don't ever load pickle files whose origin I don't know, and I expect others wanting to contribute to fixing this issue might share that hesistation.
From https://docs.python.org/3/library/pickle.html
Warning The
pickle
module is not secure. Only unpickle data you trust.It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
@jameslamb - ok use this instead: X_y.zip
import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()
I'm having the same issue over here!
bin size 257 cannot run on GPU
@jameslamb - were you able to check with above two .csv files for X and y?
Here the full thing for simplicity: https://github.com/microsoft/LightGBM/files/7817145/X_y.zip
import lightgbm as lgb
print(lgb.__version__)
import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()
from lightgbm.sklearn import LGBMRegressor
model = LGBMRegressor(max_bin=252, device_type='gpu')
model.fit(X, y)
print("OK1")
model = LGBMRegressor(max_bin=253, device_type='gpu')
model.fit(X, y)
print("OK2")
were you able to check with above two .csv files for X and y
I was not. If you're subscribed to this issue, you'll be notified when someone picks this up or has new information to share.
this is a bug for lightGBM for GPU,when use CPU,it is OK.
Any update so far on this issue?
I'm having the same issue :(
same issue too :(
Still have this issue.
I have the same issue
"LightGBMError: bin size 1973 cannot run on GPU."
It runs alright using CPU.
For everyone who encounters this issue with the -DUSE_GPU=ON
version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON
.
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.
For everyone who encounters this issue with the
-DUSE_GPU=ON
version of LightGBM, please check our latest GPU version which should be compiled with-DUSE_CUDA=ON
. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.
I have followed these instructions to install the CUDA version instead of the GPU version, but I still have the same issue:
LightGBMError: bin size XXX cannot run on GPU.
For more info, I am running on a linux server with cuda 12.1 with A100. Let me know if more info are needed to fix this issue.
Same issue with GPU version on windows, works fine on CPU [LightGBM] [Fatal] bin size 260 cannot run on GPU
[LightGBM] [Info] Finished loading parameters [LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin [LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file. [LightGBM] [Info] Finished loading data in 286.006354 seconds [LightGBM] [Info] This is the GPU trainer!! [LightGBM] [Info] Total Bins 278556290 [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793 [LightGBM] [Fatal] bin size 260 cannot run on GPU Met Exceptions: bin size 260 cannot run on GPU
For everyone who encounters this issue with the
-DUSE_GPU=ON
version of LightGBM, please check our latest GPU version which should be compiled with-DUSE_CUDA=ON
. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.
I have realized that after compiling lightgbm with the cuda option, and then using the command sudo sh ./build-python.sh install --precompile
to install it as highlighted in the documentation, it defaults to installing the pip repo version. I have not verified that by inspecting the build-python.sh
script, but my workaround was to build the pip wheel package myself. This solves the issue, and when specifying device_type=cuda
works correctly as expected.
On a side note, the main issue of cuda memory still persists, and this relates to the fact that a categorical feature has too many unique values (I tested by omitting that feature and it works fine on both gpu, cuda and cpu). But when including that feature, using the gpu version I get LightGBMError: bin size XXX cannot run on GPU
, it works fine on the CPU, but takes a very long time, and using the cuda version, you can find the error below (optuna study multiple workers).
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/treelearner/cuda/cuda_best_split_finder.cu 2066
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
terminate called after throwing an instance of 'std::runtime_error'
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
what(): [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67
So it seems that there is a limitation in the implementation when it comes to categorical features on cuda/gpu, that requires a fix.
I have the same issue
"LightGBMError: bin size 512 cannot run on GPU."
I know there are a couple other issues that mention this problem, but it's gotten messy with suggestions it's related to categorical_feature setting and other stuff. Here is clean MRE.
d9a96c90cb479cef87047ba20517d97982b563eb
lgb257.pkl.zip
FYI a model.get_params() shows:
and FYI here is kwargs:
Running
fails same way, but I'm unsure for sklearn API if it is using 'auto' for categorical_feature then.