Open LZhen0711 opened 3 months ago
I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes.
In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup
fixed_params = {
"objective": "binary",
"metric": "auc",
"boosting_type": "gbdt",
"data_sample_strategy": "bagging",
"num_iterations": 5000,
"device_type": "cuda",
"random_state": 6241,
"force_row_wise": True,
"bagging_seed": 113,
"early_stopping_rounds": 100,
"verbose": 2,
}
gbm = lightgbm.train(
**fixed_params,
train_pool,
valid_sets=[valid_pool],
valid_names=['valid'],
)
Here's the LGBM log before it crashes
Here are my Env Info
Driver Version: 535.104.05 CUDA Version: 12.2
lightgbm==4.4.0
but I have verified that this behavior is the same in v4.2.0
.
Description
By using CUDA histogram of the master branch, the simple python code report memory error if it uses large max_bin size
Reproducible example
And it will report error:
Environment info
GPU: NVIDIA GeForce RTX 3060 Python: 3.12.4 LightGBM version or commit hash: master branch
LightGBM version or commit hash:
Command(s) you used to install LightGBM
Additional Comments