Open dluks opened 6 days ago
have you tried using: RUN pip3 install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm , this has the least headache when built within docker with nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 as based. cmake was updated to 3.28.5 manually because ubuntu22.04 repository only has lower version.
Interesting suggestion, but I'm not able to get that solution working in a Docker container either... FWIW, my local cmake
version is 3.31.0 (installed via snap
), and, as I said, I was (I think) able to successfully build LightGBM with CUDA enabled, but it still fails with the
lightgbm.basic.LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at ~/LightGBM/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .
error when implemented in code. This remains true if I install LightGBM inside a Docker container with something like
RUN python3.11 -m pip install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm
and then copy the installed package outside of the container into my local environment.
The build will be successful even if the CUDA version does not match (driver, lgbm, etc) and will cause error when attempting to run a model. Which is probability the cause i reckon.
For a quick test (5 to 10mins build time), you can try the following dockerfile (tested and working for 2 hosts(pc and laptop)). After build complete, run(8888 is default jupyter port, unless your host is already using the port #, it should work):
docker run --gpus all -p 8888:8888 <image-name>
enter http://localhost:8888/lab in your browser and you can test your code.
Note: notebook is set with no password for testing, hence, you might want to change that if it works. check nvidia-smi in docker and host that both has the same CUDA Version if you are still encountering problem. This is from my host: NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
#################################################################################################################
# Global
#################################################################################################################
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive
#################################################################################################################
# SYSTEM
#################################################################################################################
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
gnupg \
swig \
vim \
mercurial \
subversion \
python3-dev \
python3-pip \
python3-setuptools \
ocl-icd-opencl-dev \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++ && \
# Remove old CMake and install the latest version
apt-get remove -y cmake && \
curl -fsSL https://github.com/Kitware/CMake/releases/download/v3.28.5/cmake-3.28.5-linux-x86_64.tar.gz | tar -xz -C /usr/local --strip-components=1 && \
# Install Node.js 18.x
curl -fsSL https://deb.nodesource.com/setup_18.x | bash - && \
apt-get update && \
apt-get install -y nodejs=18.20.4*
# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
#################################################################################################################
# ML libraries and dependencies
#################################################################################################################
RUN pip3 install --upgrade pip
RUN pip3 install jupyterlab==4.2.5
RUN pip3 install scikit-learn==1.5.2
RUN pip3 install --no-binary lightgbm --config-settings=cmake.define.USE_CUDA=ON lightgbm
RUN apt-get autoremove -y && apt-get clean && \
rm -rf /var/lib/apt/lists/*
CMD ["jupyter-lab", "--ip=0.0.0.0", "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''", "--no-browser"]
Lastly, lightgbm installed cmd is kept at the end, hence, you can switch different installation methods without having to rebuild the prior installations, making bebugging lgbm installation much faster. If you are missing any SYSTEM app, just add-on within the system section, and it will only rebuild those after the line. Hopefully this will get a working lgbm on your machine. For completeness, i used the following to test the build:
import numpy as np
from sklearn.model_selection import train_test_split
import lightgbm as lgb
np.random.seed(42)
n_samples = 500 * 10000
n_features = 51
X = np.random.rand(n_samples, n_features).astype(np.float32)
y = np.random.rand(n_samples).astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train = np.ascontiguousarray(X_train, dtype=np.float32)
y_train = np.ascontiguousarray(y_train, dtype=np.float32)
new_lgb_train = lgb.Dataset(X_train, label=y_train)
cuda_params = {
'objective': 'regression',
'boosting_type': 'dart',
'colsample_bytree': 0.7,
'learning_rate': 0.01,
'max_depth': 7,
'subsample': 0.7,
'n_jobs': 32,
'num_leaves': 63,
'verbose': 1,
'device': 'cuda',
'force_row_wise': True
}
gbm_cuda = lgb.train(cuda_params,
new_lgb_train,
num_boost_round=100)
gbm_cuda.predict(X_test)
I'm not sure I'm following. Is the purpose of building LightGBM inside a Docker container to solve this supposed mismatch between NVIDIA driver and Cuda version? Or is it to identify such a mismatch in the first place so that I can solve it locally? The project I am working on doesn't use Docker, so even if this works, I still need to be able to use LightGBM with Cuda outside of the container.
This feels like a pretty circuitous workaround for a pretty common OS and GPU combination. It would be great to hear from a current maintainer if the issue I originally posted above is reproducible (and therefore an actual bug), or something particular to my system.
Using docker is primarily to avoid conflicts between NVIDIA drivers and CUDA versions and comparing msi between docker and host will identify mismatch which you can solve locally with the correct version installation. However, the main reason is that I work with multiple ML frameworks and libraries, and docker helps manage conflicting dependencies without risking issues on my host system, hence, suggested what worked for my use.
Anyways, hope the current maintainer can provide a solution for you soon. Cheers!
I was able to fix this issue, though I did change a few things all at once so I can't be 100% sure what precisely did it. Here's what I did:
gcc
and g++
from version 11 -> 10cmake
from 3.31 to 3.28Things I'm pretty sure of:
gcc
and g++
from 11 -> 10allowed me to build the CUDA version of LightGBM per the documentation.cmake
probably didn't hurt, but it may not have been necessary.Downgrading gcc
and g++
I followed the lead of this comment:
sudo apt install gcc-10 g++-10
export CC=/usr/bin/gcc-10
export CXX=/usr/bin/g++-10
export CUDA_ROOT=/usr/local/cuda
ln -s /usr/bin/gcc-10 $CUDA_ROOT/bin/gcc
ln -s /usr/bin/g++-10 $CUDA_ROOT/bin/g++
Upgrading cuda-toolkit and NVIDIA drivers I simply followed the instructions here.
Downgrading cmake
Because the highest version of cmake
on Ubuntu 22.04 is currently 3.22.1, I use snap
to access more recent versions. The most recent is 3.31, but I saw accounts of others having success with 3.28, so I figured what the heck, might as well match it.
sudo snap refresh cmake --channel=3.28/stable
Description
My issue is very similar to #6705, though I believe I can rule out the compute capacity of my GPUs as the issue as they are >8 (NVIDIA RTX A6000).
Reproducible example
Environment info
LightGBM version or commit hash: Commit: 5151fe85f08e5dccff7d48242dddace51f9c8ede
Command(s) you used to install LightGBM
I followed the instructions here, with a slight modification.
Note that only the inclusion of
--target _lightgbm
worked for me. Otherwise I encountered the same issue as reported in #5089.OS: Ubuntu 22.04.5 LTS
Cuda version:
NVIDIA driver version: 535.183.01 GPU: NVIDIA RTX A6000
Python version: 3.11.9
Additional information
Traceback from the example script: