[gpu] "Cannot build GPU program: Build Program Failure" when trying to use dockerfile.gpu from official repo.

Description

Receiving "Cannot build GPU program: Build Program Failure" when running dockerized gpu version of lightgbm.

>>> model.fit(train, label)
Build Options: -D POWER_FEATURE_WORKGROUPS=0 -D USE_CONSTANT_BUF=0 -D USE_DP_FLOAT=0 -D CONST_HESSIAN=0 -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math
Build Log:

[LightGBM] [Fatal] Cannot build GPU program: Build Program Failure
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot build GPU program: Build Program Failure
Aborted (core dumped)

Reproducible example

nvidia-docker run --rm -d --name lightgbm-gpu -p 8888:8888 -v /home:/home lightgbm-gpu

docker exec -it lightgbm-gpu bash
source activate py3

python

import lightgbm
import gc
from lightgbm import LGBMClassifier
from sklearn.datasets import make_moons

# This does not matter if we run or not, same error occurs
gc.enable()

model = LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=300, device = "gpu")
train, label = make_moons(n_samples=30, shuffle=True, noise=0.3, random_state=None)
model.fit(train, label)

Environment info

LightGBM version or commit hash: 3.3.2

Command(s) you used to install LightGBM

Installation was run by following docs https://github.com/microsoft/LightGBM/tree/master/docker/gpu

That is:

mkdir lightgbm-docker
cd lightgbm-docker
wget https://raw.githubusercontent.com/Microsoft/LightGBM/master/docker/gpu/dockerfile.gpu
docker build -f dockerfile.gpu -t lightgbm-gpu .

Host machine is ArchLinux. The installation broke at some point, maybe when opencl/cuda version was changed

Additional Comments

Here is a complete output of command line including all installation steps and running the code, with errors.

[I] ➜ mkdir lightgbm-docker
cd lightgbm-docker
wget https://raw.githubusercontent.com/Microsoft/LightGBM/master/docker/gpu/dockerfile.gpu
docker build -f dockerfile.gpu -t lightgbm-gpu .
mkdir: cannot create directory ‘lightgbm-docker’: File exists
--2022-11-04 10:17:52--  https://raw.githubusercontent.com/Microsoft/LightGBM/master/docker/gpu/dockerfile.gpu
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5433 (5.3K) [text/plain]
Saving to: ‘dockerfile.gpu.1’

dockerfile.gpu.1  100%[===========>]   5.31K  --.-KB/s    in 0s

2022-11-04 10:17:52 (18.5 MB/s) - ‘dockerfile.gpu.1’ saved [5433/5433]

Sending build context to Docker daemon  19.97kB
Step 1/27 : FROM nvidia/cuda:8.0-cudnn5-devel
 ---> 7de94bfdc613
Step 2/27 : ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
 ---> Using cache
 ---> 07ab91127506
Step 3/27 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> f1d51bd987ac
Step 4/27 : ENV CUDA_HOME /usr/local/cuda
 ---> Using cache
 ---> a929bf62d814
Step 5/27 : ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
 ---> Using cache
 ---> 5127e5ab696c
Step 6/27 : ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/lib
 ---> Using cache
 ---> 7d86190bec1c
Step 7/27 : ENV OPENCL_LIBRARIES /usr/local/cuda/lib64
 ---> Using cache
 ---> f3d5f249092f
Step 8/27 : ENV OPENCL_INCLUDE_DIR /usr/local/cuda/include
 ---> Using cache
 ---> d5b5b999c817
Step 9/27 : ENV TINI_VERSION v0.14.0
 ---> Using cache
 ---> da6e4ba78335
Step 10/27 : ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
Downloading  19.89kB/19.89kB

 ---> Using cache
 ---> 78626d89ded1
Step 11/27 : RUN chmod +x /tini
 ---> Using cache
 ---> 9e6fc20b0163
Step 12/27 : RUN apt-get update && apt-get install -y --no-install-recommends build-essential curl bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git vim mercurial subversion cmake libboost-dev libboost-system-dev libboost-filesystem-dev gcc g++
 ---> Using cache
 ---> ebba0679d6b4
Step 13/27 : RUN mkdir -p /etc/OpenCL/vendors &&     echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
 ---> Using cache
 ---> d9ce0f6cf56d
Step 14/27 : ARG CONDA_DIR=/opt/miniforge
 ---> Using cache
 ---> 7de9c7325372
Step 15/27 : ENV PATH $CONDA_DIR/bin:$PATH
 ---> Using cache
 ---> c865b091b77a
Step 16/27 : RUN echo "export PATH=$CONDA_DIR/bin:"'$PATH' > /etc/profile.d/conda.sh &&     curl -sL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -o ~/miniforge.sh &&     /bin/bash ~/miniforge.sh -b -p $CONDA_DIR &&     rm ~/miniforge.sh
 ---> Using cache
 ---> 1c3a2a3f4e9b
Step 17/27 : RUN conda config --set always_yes yes --set changeps1 no &&     conda create -y -q -n py3 numpy scipy scikit-learn jupyter notebook ipython pandas matplotlib
 ---> Using cache
 ---> a9f356c370db
Step 18/27 : RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm &&     git clone --recursive --branch stable --depth 1 https://github.com/microsoft/LightGBM &&     cd LightGBM && mkdir build && cd build &&     cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. &&     make OPENCL_HEADERS=/usr/local/cuda-8.0/targets/x86_64-linux/include LIBOPENCL=/usr/local/cuda-8.0/targets/x86_64-linux/lib
 ---> Using cache
 ---> 06e492c85ef1
Step 19/27 : ENV PATH /usr/local/src/lightgbm/LightGBM:${PATH}
 ---> Using cache
 ---> 08634b9a991f
Step 20/27 : RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM/python-package && python setup.py install --precompile && source deactivate"
 ---> Using cache
 ---> 7dc1afe252d8
Step 21/27 : RUN apt-get autoremove -y && apt-get clean &&     rm -rf /var/lib/apt/lists/* &&     conda clean -a -y
 ---> Using cache
 ---> 9a6a1b84dde7
Step 22/27 : RUN mkdir -p -m 700 ~/.jupyter/ &&     echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py
 ---> Using cache
 ---> c9b339a3447d
Step 23/27 : VOLUME /home
 ---> Using cache
 ---> d57cbc96533a
Step 24/27 : WORKDIR /home
 ---> Using cache
 ---> a227fdbf8fee
Step 25/27 : EXPOSE 8888
 ---> Using cache
 ---> f42dd7c0d1f3
Step 26/27 : ENTRYPOINT [ "/tini", "--" ]
 ---> Using cache
 ---> d73d12b5bf5d
Step 27/27 : CMD /bin/bash -c "source activate py3 && jupyter notebook --allow-root --no-browser --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824' && source deactivate"
 ---> Using cache
 ---> 6499ac8497c8
Successfully built 6499ac8497c8
Successfully tagged lightgbm-gpu:latest

~/prj/lightgbm-docker
[I] ➜ nvidia-docker run --rm -d --name lightgbm-gpu -p 8888:8888 -v /home:/home lightgbm-gpu
7fba82b486cca158aac34011d19687641ad82201879a019ba8c61335bbda6b6d

~/prj/lightgbm-docker
[I] ➜ docker exec -it lightgbm-gpu bash

root@7fba82b486cc:/home# python
Python 3.10.2 | packaged by conda-forge | (main, Mar  8 2022, 15:52:01) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import lightgbm
>>> import gc
>>> from lightgbm import LGBMClassifier
>>> from sklearn.datasets import make_moons
>>> gc.enable()
>>> model = LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=300, device = "gpu")
>>> train, label = make_moons(n_samples=30, shuffle=True, noise=0.3, random_state=None)
>>> model.fit(train, label)
Build Options: -D POWER_FEATURE_WORKGROUPS=0 -D USE_CONSTANT_BUF=0 -D USE_DP_FLOAT=0 -D CONST_HESSIAN=0 -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math
Build Log:

[LightGBM] [Fatal] Cannot build GPU program: Build Program Failure
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot build GPU program: Build Program Failure
Aborted (core dumped)

Thanks very much for using LightGBM and for the very thorough repo!

I have a few clarifying questions, and some other observations which might at least help narrow down the problem.

what type of GPU is available on this machine?
is the GPU active?
- e.g., if you have an NVIDIA GPU, what does running nvidia-smi on the host (not in docker) return?
why are you using nvidia-docker run to start a container and then docker exec-ing into it to run model training? Does the error go away if you just directly nvidia-docker run ... python instead?

The installation broke at some point

Are you saying that exactly the same code you've provided here used to run successfully on this same machine? If so, are you able to provide a LightGBM commit (or at least rough date) that you last observed this working in your set up? That would be helpful in narrowing down what changes have happened which might impact you.

Thanks very much for using LightGBM and for the very thorough repo!

I have a few clarifying questions, and some other observations which might at least help narrow down the problem.

what type of GPU is available on this machine?

1060 6Gb

is the GPU active?

Yes

e.g., if you have an NVIDIA GPU, what does running nvidia-smi on the host (not in docker) return?

Will get back later with this one. I am not currently on the machine.

why are you using nvidia-docker run to start a container and then docker exec-ing into it to run model training? Does the error go away if you just directly nvidia-docker run ... python instead?

Will get back later with this one too.

The installation broke at some point

Are you saying that exactly the same code you've provided here used to run successfully on this same machine? If so, are you able to provide a LightGBM commit (or at least rough date) that you last observed this working in your set up? That would be helpful in narrowing down what changes have happened which might impact you.

I think it was the same commit. The build stoped working after host machine changes. I think install broke after this update cuda update happend on host machine. https://archlinux.org/packages/community/x86_64/cuda/

Let me know if I can provide anything else. I will get back with couple of aditional points.

nvidia-smi output:

Mon Nov  7 10:05:20 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 49%   31C    P8     9W / 120W |    353MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1438      G   /usr/lib/Xorg                     195MiB |
|    0   N/A  N/A      1554      G   alacritty                           9MiB |
|    0   N/A  N/A     17004      G   alacritty                           9MiB |
|    0   N/A  N/A     23812      G   ...veSuggestionsOnlyOnDemand       60MiB |
|    0   N/A  N/A     23813      G   ...131815054667946746,131072       74MiB |
+-----------------------------------------------------------------------------+

Running python straight through nvidia-docker run did not change anything, same error is thrown when trying to fit a model that was instantiated with device="gpu"

Can I provide any further information to solve this? @jameslamb

It will be a while (on the order of weeks) until I personally will be able to investigate this further. This project is really struggling from a lack of maintainer attention and availability right now, and I'm focusing on other more time-sensitive issues at the moment: https://github.com/microsoft/LightGBM/issues/5153#issuecomment-1319532263.

If you investigate yourself and find any information that might help, please do post it here.

I also stumbled upon this issue. If it helps debugging:

             /////////////                lucas@pop-os 
         /////////////////////            ------------ 
      ///////*767////////////////         OS: Pop!_OS 20.04 LTS x86_64 
    //////7676767676*//////////////       Kernel: 5.17.5-76051705-generic 
   /////76767//7676767//////////////      Uptime: 25 mins 
  /////767676///*76767///////////////     Packages: 2624 (dpkg), 115 (nix-user), 46 (nix-default), 6 (f 
 ///////767676///76767.///7676*///////    Shell: bash 5.0.17 
/////////767676//76767///767676////////   Resolution: 1920x1080, 1920x1080 
//////////76767676767////76767/////////   DE: GNOME 
///////////76767676//////7676//////////   WM: Mutter 
////////////,7676,///////767///////////   WM Theme: Pop 
/////////////*7676///////76////////////   Theme: Pop-dark [GTK2/3] 
///////////////7676////////////////////   Icons: Pop [GTK2/3] 
 ///////////////7676///767////////////    Terminal: gnome-terminal 
  //////////////////////'////////////     CPU: AMD Ryzen 7 3800X (16) @ 3.900GHz 
   //////.7676767676767676767,//////      GPU: NVIDIA GeForce RTX 2070 Rev. A 
    /////767676767676767676767/////       Memory: 5944MiB / 64303MiB 
      ///////////////////////////
         /////////////////////                                    
             /////////////

lucas@pop-os:~$ nvidia-smi
Tue Nov 22 20:24:23 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:09:00.0  On |                  N/A |
|  0%   53C    P8    26W / 185W |    506MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2372      G   /usr/lib/xorg/Xorg                 59MiB |
|    0   N/A  N/A      4604      G   /usr/lib/xorg/Xorg                182MiB |
|    0   N/A  N/A      4866      G   /usr/bin/gnome-shell               28MiB |
|    0   N/A  N/A      7253      G   ...veSuggestionsOnlyOnDemand       65MiB |
|    0   N/A  N/A      9600      G   ...RendererForSitePerProcess       37MiB |
|    0   N/A  N/A     10378      G   /usr/lib/firefox/firefox          121MiB |
+-----------------------------------------------------------------------------+

lucas@pop-os:~$ sudo docker run -it --gpus all nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 nvidia-smi
Tue Nov 22 23:24:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:09:00.0  On |                  N/A |
|  0%   55C    P5    32W / 185W |    506MiB /  8192MiB |     38%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

One thing to notice is that I was able to get Catboost to train correctly on the GPU but the same docker image is not able to run LGBM.

Downgrading drivers fixed it for me.

I was able to get it to run by compiling LGBM 3.3.1 and drivers 515, latest cuda

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

....

RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
    git clone --recursive --branch v3.3.1 --depth 1 https://github.com/microsoft/LightGBM && \
    cd LightGBM && mkdir build && cd build && \
    cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
    make -j8 OPENCL_HEADERS=/usr/local/cuda-11.8.0/targets/x86_64-linux/include LIBOPENCL=/usr/local/cuda-11.8.0/targets/x86_64-linux/lib

sudo docker run -it --gpus all nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 nvidia-smi
Wed Nov 23 15:22:40 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
| 26%   55C    P0    49W / 185W |    600MiB /  8192MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Downgrading nvidia drivers to 515 solved it to me too!

@lucasavila00 Thank you.

I wonder if nvidia is aware that there are problems with the new drivers.

@lucasavila00 @aidiss Thank you for sharing. I also downgraded driver 520->515 and solved the problem.

Although my configuration is a bit different, it seems driver version is the root cause of the issue.

CUDA: 11.7
lightgbm: 3.3.3
driver: 515.86.1

FROM nvidia/cuda:11.7.0-devel-ubuntu20.04
...
RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
    git clone --recursive --branch stable --depth 1 https://github.com/microsoft/LightGBM && \
    cd LightGBM && mkdir build && cd build && \
    cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
    make OPENCL_HEADERS=/usr/local/cuda-11.7.0/targets/x86_64-linux/include LIBOPENCL=/usr/local/cuda-11.7.0/targets/x86_64-linux/lib

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  Off |
|  0%   28C    P8    26W / 450W |      1MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However, I think we shouldn't conclude this issue is caused by cuda driver. There still be a possibility of lightgbm's issue, isn't it?

microsoft / LightGBM