microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.72k stars 3.84k forks source link

[CUDA] GPU training fails with (split_indices_block_size_data_partition) > (0) on Ubuntu 22.04 #6727

Open dluks opened 6 days ago

dluks commented 6 days ago

Description

My issue is very similar to #6705, though I believe I can rule out the compute capacity of my GPUs as the issue as they are >8 (NVIDIA RTX A6000).

Reproducible example

import lightgbm as lgb
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=10000, n_features=150, noise=0.1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the LightGBM regressor with GPU support
reg = lgb.LGBMRegressor(objective="regression", device="cuda", verbose=3)

reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Environment info

LightGBM version or commit hash: Commit: 5151fe85f08e5dccff7d48242dddace51f9c8ede

Command(s) you used to install LightGBM

I followed the instructions here, with a slight modification.

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
cmake -B build -S . -DUSE_CUDA=1
cmake --build build --target _lightgbm -j2
sh build-python.sh install --precompile

Note that only the inclusion of --target _lightgbm worked for me. Otherwise I encountered the same issue as reported in #5089.

OS: Ubuntu 22.04.5 LTS

Cuda version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0

NVIDIA driver version: 535.183.01 GPU: NVIDIA RTX A6000

Python version: 3.11.9

Additional information

Traceback from the example script:

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000000
[LightGBM] [Info] Total Bins 38250
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 150
[LightGBM] [Fatal] Check failed: (split_indices_block_size_data_partition) > (0) at ~/LightGBM/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .

Traceback (most recent call last):
  File "~/lgb_error.py", line 16, in <module>
    reg.fit(X_train, y_train)
  File "~/lib/python3.11/site-packages/lightgbm/sklearn.py", line 1313, in fit
    super().fit(
  File "~/lib/python3.11/site-packages/lightgbm/sklearn.py", line 1015, in fit
    self._Booster = train(
                    ^^^^^^
  File "~/lib/python3.11/site-packages/lightgbm/engine.py", line 322, in train
    booster.update(fobj=fobj)
  File "~/lib/python3.11/site-packages/lightgbm/basic.py", line 4143, in update
    _safe_call(
  File "~/lib/python3.11/site-packages/lightgbm/basic.py", line 295, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
lightgbm.basic.LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at ~/LightGBM/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .
sgapple commented 6 days ago

have you tried using: RUN pip3 install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm , this has the least headache when built within docker with nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 as based. cmake was updated to 3.28.5 manually because ubuntu22.04 repository only has lower version.

dluks commented 5 days ago

Interesting suggestion, but I'm not able to get that solution working in a Docker container either... FWIW, my local cmake version is 3.31.0 (installed via snap), and, as I said, I was (I think) able to successfully build LightGBM with CUDA enabled, but it still fails with the

lightgbm.basic.LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at ~/LightGBM/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .

error when implemented in code. This remains true if I install LightGBM inside a Docker container with something like

RUN python3.11 -m pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm

and then copy the installed package outside of the container into my local environment.

sgapple commented 5 days ago

The build will be successful even if the CUDA version does not match (driver, lgbm, etc) and will cause error when attempting to run a model. Which is probability the cause i reckon.

For a quick test (5 to 10mins build time), you can try the following dockerfile (tested and working for 2 hosts(pc and laptop)). After build complete, run(8888 is default jupyter port, unless your host is already using the port #, it should work):

docker run --gpus all -p 8888:8888 <image-name>

enter http://localhost:8888/lab in your browser and you can test your code.

Note: notebook is set with no password for testing, hence, you might want to change that if it works. check nvidia-smi in docker and host that both has the same CUDA Version if you are still encountering problem. This is from my host: NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4

FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

#################################################################################################################
#           Global
#################################################################################################################
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive

#################################################################################################################
# SYSTEM
#################################################################################################################
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
gnupg \
swig \
vim \
mercurial \
subversion \
python3-dev \
python3-pip \
python3-setuptools \
ocl-icd-opencl-dev \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++ && \
# Remove old CMake and install the latest version
apt-get remove -y cmake && \
curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.28.5/cmake-3.28.5-linux-x86_64.tar.gz | tar -xz -C /usr/local --strip-components=1 && \
# Install Node.js 18.x
curl -fsSL https://deb.nodesource.com/setup_18.x | bash - && \
apt-get update && \
apt-get install -y nodejs=18.20.4*

# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

#################################################################################################################
#           ML libraries and dependencies
#################################################################################################################
RUN pip3 install --upgrade pip
RUN pip3 install jupyterlab==4.2.5
RUN pip3 install scikit-learn==1.5.2
RUN pip3 install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm

RUN apt-get autoremove -y && apt-get clean && \
    rm -rf /var/lib/apt/lists/*

CMD ["jupyter-lab", "--ip=0.0.0.0", "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''", "--no-browser"]

Lastly, lightgbm installed cmd is kept at the end, hence, you can switch different installation methods without having to rebuild the prior installations, making bebugging lgbm installation much faster. If you are missing any SYSTEM app, just add-on within the system section, and it will only rebuild those after the line. Hopefully this will get a working lgbm on your machine. For completeness, i used the following to test the build:

import numpy as np
from sklearn.model_selection import train_test_split
import lightgbm as lgb

np.random.seed(42)
n_samples = 500 * 10000
n_features = 51

X = np.random.rand(n_samples, n_features).astype(np.float32)
y = np.random.rand(n_samples).astype(np.float32)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train = np.ascontiguousarray(X_train, dtype=np.float32)
y_train = np.ascontiguousarray(y_train, dtype=np.float32)

new_lgb_train = lgb.Dataset(X_train, label=y_train)

cuda_params = {
    'objective': 'regression',
    'boosting_type': 'dart',
    'colsample_bytree': 0.7,
    'learning_rate': 0.01,
    'max_depth': 7,
    'subsample': 0.7,
    'n_jobs': 32,
    'num_leaves': 63,
    'verbose': 1,
    'device': 'cuda',
    'force_row_wise': True
}

gbm_cuda = lgb.train(cuda_params,
                new_lgb_train,
                num_boost_round=100)

gbm_cuda.predict(X_test)
dluks commented 5 days ago

I'm not sure I'm following. Is the purpose of building LightGBM inside a Docker container to solve this supposed mismatch between NVIDIA driver and Cuda version? Or is it to identify such a mismatch in the first place so that I can solve it locally? The project I am working on doesn't use Docker, so even if this works, I still need to be able to use LightGBM with Cuda outside of the container.

This feels like a pretty circuitous workaround for a pretty common OS and GPU combination. It would be great to hear from a current maintainer if the issue I originally posted above is reproducible (and therefore an actual bug), or something particular to my system.

sgapple commented 4 days ago

Using docker is primarily to avoid conflicts between NVIDIA drivers and CUDA versions and comparing msi between docker and host will identify mismatch which you can solve locally with the correct version installation. However, the main reason is that I work with multiple ML frameworks and libraries, and docker helps manage conflicting dependencies without risking issues on my host system, hence, suggested what worked for my use.

Anyways, hope the current maintainer can provide a solution for you soon. Cheers!

dluks commented 4 days ago

I was able to fix this issue, though I did change a few things all at once so I can't be 100% sure what precisely did it. Here's what I did:

Things I'm pretty sure of:

Downgrading gcc and g++ I followed the lead of this comment:

sudo apt install gcc-10 g++-10
export CC=/usr/bin/gcc-10
export CXX=/usr/bin/g++-10
export CUDA_ROOT=/usr/local/cuda
ln -s /usr/bin/gcc-10 $CUDA_ROOT/bin/gcc
ln -s /usr/bin/g++-10 $CUDA_ROOT/bin/g++

Upgrading cuda-toolkit and NVIDIA drivers I simply followed the instructions here.

Downgrading cmake Because the highest version of cmake on Ubuntu 22.04 is currently 3.22.1, I use snap to access more recent versions. The most recent is 3.31, but I saw accounts of others having success with 3.28, so I figured what the heck, might as well match it.

sudo snap refresh cmake --channel=3.28/stable