Open aidiss opened 2 years ago
Thanks very much for using LightGBM and for the very thorough repo!
I have a few clarifying questions, and some other observations which might at least help narrow down the problem.
nvidia-smi
on the host (not in docker) return?nvidia-docker run
to start a container and then docker exec
-ing into it to run model training? Does the error go away if you just directly nvidia-docker run ... python
instead?The installation broke at some point
Are you saying that exactly the same code you've provided here used to run successfully on this same machine? If so, are you able to provide a LightGBM commit (or at least rough date) that you last observed this working in your set up? That would be helpful in narrowing down what changes have happened which might impact you.
Thanks very much for using LightGBM and for the very thorough repo!
I have a few clarifying questions, and some other observations which might at least help narrow down the problem.
- what type of GPU is available on this machine?
1060 6Gb
- is the GPU active?
Yes
- e.g., if you have an NVIDIA GPU, what does running
nvidia-smi
on the host (not in docker) return?
Will get back later with this one. I am not currently on the machine.
- why are you using
nvidia-docker run
to start a container and thendocker exec
-ing into it to run model training? Does the error go away if you just directlynvidia-docker run ... python
instead?
Will get back later with this one too.
The installation broke at some point
Are you saying that exactly the same code you've provided here used to run successfully on this same machine? If so, are you able to provide a LightGBM commit (or at least rough date) that you last observed this working in your set up? That would be helpful in narrowing down what changes have happened which might impact you.
I think it was the same commit. The build stoped working after host machine changes. I think install broke after this update cuda update happend on host machine. https://archlinux.org/packages/community/x86_64/cuda/
Let me know if I can provide anything else. I will get back with couple of aditional points.
nvidia-smi output:
Mon Nov 7 10:05:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 49% 31C P8 9W / 120W | 353MiB / 6144MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1438 G /usr/lib/Xorg 195MiB |
| 0 N/A N/A 1554 G alacritty 9MiB |
| 0 N/A N/A 17004 G alacritty 9MiB |
| 0 N/A N/A 23812 G ...veSuggestionsOnlyOnDemand 60MiB |
| 0 N/A N/A 23813 G ...131815054667946746,131072 74MiB |
+-----------------------------------------------------------------------------+
Running python straight through nvidia-docker run
did not change anything, same error is thrown when trying to fit a model that was instantiated with device="gpu"
Can I provide any further information to solve this? @jameslamb
It will be a while (on the order of weeks) until I personally will be able to investigate this further. This project is really struggling from a lack of maintainer attention and availability right now, and I'm focusing on other more time-sensitive issues at the moment: https://github.com/microsoft/LightGBM/issues/5153#issuecomment-1319532263.
If you investigate yourself and find any information that might help, please do post it here.
I also stumbled upon this issue. If it helps debugging:
///////////// lucas@pop-os
///////////////////// ------------
///////*767//////////////// OS: Pop!_OS 20.04 LTS x86_64
//////7676767676*////////////// Kernel: 5.17.5-76051705-generic
/////76767//7676767////////////// Uptime: 25 mins
/////767676///*76767/////////////// Packages: 2624 (dpkg), 115 (nix-user), 46 (nix-default), 6 (f
///////767676///76767.///7676*/////// Shell: bash 5.0.17
/////////767676//76767///767676//////// Resolution: 1920x1080, 1920x1080
//////////76767676767////76767///////// DE: GNOME
///////////76767676//////7676////////// WM: Mutter
////////////,7676,///////767/////////// WM Theme: Pop
/////////////*7676///////76//////////// Theme: Pop-dark [GTK2/3]
///////////////7676//////////////////// Icons: Pop [GTK2/3]
///////////////7676///767//////////// Terminal: gnome-terminal
//////////////////////'//////////// CPU: AMD Ryzen 7 3800X (16) @ 3.900GHz
//////.7676767676767676767,////// GPU: NVIDIA GeForce RTX 2070 Rev. A
/////767676767676767676767///// Memory: 5944MiB / 64303MiB
///////////////////////////
/////////////////////
/////////////
lucas@pop-os:~$ nvidia-smi
Tue Nov 22 20:24:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:09:00.0 On | N/A |
| 0% 53C P8 26W / 185W | 506MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2372 G /usr/lib/xorg/Xorg 59MiB |
| 0 N/A N/A 4604 G /usr/lib/xorg/Xorg 182MiB |
| 0 N/A N/A 4866 G /usr/bin/gnome-shell 28MiB |
| 0 N/A N/A 7253 G ...veSuggestionsOnlyOnDemand 65MiB |
| 0 N/A N/A 9600 G ...RendererForSitePerProcess 37MiB |
| 0 N/A N/A 10378 G /usr/lib/firefox/firefox 121MiB |
+-----------------------------------------------------------------------------+
lucas@pop-os:~$ sudo docker run -it --gpus all nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 nvidia-smi
Tue Nov 22 23:24:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:09:00.0 On | N/A |
| 0% 55C P5 32W / 185W | 506MiB / 8192MiB | 38% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
One thing to notice is that I was able to get Catboost to train correctly on the GPU but the same docker image is not able to run LGBM.
Downgrading drivers fixed it for me.
I was able to get it to run by compiling LGBM 3.3.1 and drivers 515, latest cuda
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
....
RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
git clone --recursive --branch v3.3.1 --depth 1 https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
make -j8 OPENCL_HEADERS=/usr/local/cuda-11.8.0/targets/x86_64-linux/include LIBOPENCL=/usr/local/cuda-11.8.0/targets/x86_64-linux/lib
sudo docker run -it --gpus all nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 nvidia-smi
Wed Nov 23 15:22:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:09:00.0 On | N/A |
| 26% 55C P0 49W / 185W | 600MiB / 8192MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Downgrading nvidia drivers to 515 solved it to me too!
@lucasavila00 Thank you.
I wonder if nvidia is aware that there are problems with the new drivers.
@lucasavila00 @aidiss Thank you for sharing. I also downgraded driver 520->515 and solved the problem.
Although my configuration is a bit different, it seems driver version is the root cause of the issue.
FROM nvidia/cuda:11.7.0-devel-ubuntu20.04
...
RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
git clone --recursive --branch stable --depth 1 https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
make OPENCL_HEADERS=/usr/local/cuda-11.7.0/targets/x86_64-linux/include LIBOPENCL=/usr/local/cuda-11.7.0/targets/x86_64-linux/lib
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | Off |
| 0% 28C P8 26W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
However, I think we shouldn't conclude this issue is caused by cuda driver. There still be a possibility of lightgbm's issue, isn't it?
Description
Receiving "Cannot build GPU program: Build Program Failure" when running dockerized gpu version of lightgbm.
Reproducible example
Environment info
LightGBM version or commit hash: 3.3.2
Command(s) you used to install LightGBM
Installation was run by following docs https://github.com/microsoft/LightGBM/tree/master/docker/gpu
That is:
Host machine is ArchLinux. The installation broke at some point, maybe when opencl/cuda version was changed
Additional Comments
Here is a complete output of command line including all installation steps and running the code, with errors.