pop-os / tensorman

Utility for easy management of Tensorflow containers
GNU General Public License v3.0
206 stars 16 forks source link

GPU Support not working #28

Closed alexcpn closed 2 years ago

alexcpn commented 3 years ago

nvidia container runtime is installed

alexpunnen@pop-os:~$ sudo apt install nvidia-container-runtime
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-container-runtime is already the newest version (3.4.0-1pop1~1601325114~20.04~2880fc6).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.

Error

alexpunnen@pop-os:~$ tensorman run --gpu python -- ./script.py
"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python" "./script.py"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Driver is installed

alexpunnen@pop-os:~$ nvidia-smi
Mon May 17 19:46:54 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX250       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   55C    P0    N/A /  N/A |    260MiB /  2002MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       900      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A     14737      G   /usr/lib/xorg/Xorg                141MiB |
|    0   N/A  N/A     14867      G   /usr/bin/gnome-shell               24MiB |
|    0   N/A  N/A     18812      G   ...AAAAAAAAA= --shared-files       40MiB |
+-----------------------------------------------------------------------------+

Should we install nvidia-docker2 ? I was not able to install it

alexpunnen@pop-os:~$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker
alexpunnen@pop-os:~$ sudo apt-get install nvidia-docker2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package nvidia-docker2
alexpunnen@pop-os:~$ cat /etc/os-release
NAME="Pop!_OS"
VERSION="20.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 20.04 LTS"
VERSION_ID="20.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
LOGO=distributor-logo-pop-os
alexcpn commented 3 years ago

I gave distribution=ubuntu20.04 and tried to install but

distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get install nvidia-docker2
... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-docker2 : Depends: nvidia-container-runtime (>= 3.5.0) but 3.4.0-1pop1~1601325114~20.04~2880fc6 is to be installed
E: Unable to correct problems, you have held broken packages.

looks like https://github.com/NVIDIA/nvidia-docker/issues/1388#issuecomment-698850593; I followed the WA given here --> https://github.com/NVIDIA/nvidia-docker/issues/1388#issuecomment-698989404 and able to install nvidia-docker2 and at least cuda in docker is working. But not tensorman still gives same error

sudo docker run --rm --runtime=nvidia -ti nvidia/cuda:11.3.0-base-ubuntu20.04
Unable to find image 'nvidia/cuda:11.3.0-base-ubuntu20.04' locally
11.3.0-base-ubuntu20.04: Pulling from nvidia/cuda
a70d879fa598: Pull complete 
c4394a92d1f8: Pull complete 
10e6159c56c0: Pull complete 
f1ff119ac131: Pull complete 
3e2dbc551fee: Pull complete 
4f57fe919a49: Pull complete 
216bbbf373ef: Pull complete 
Digest: sha256:7939995fc912a21e62be16c866b62e14d383ef16ed288f1d17268ba0b7226574
Status: Downloaded newer image for nvidia/cuda:11.3.0-base-ubuntu20.04
root@159db2b73ded:/# nvidia-smi
Mon May 17 17:07:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX250       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |    292MiB /  2002MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
alexpunnen@pop-os:~$ sudo tensorman run --gpu bash
"docker" "run" "-u" "0:0" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "bash"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Aarivvk commented 2 years ago

This issue is keeping me away from upgrading to 21.10. Kindly fix the issue or give us the work around until then.

Aarivvk commented 2 years ago

https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-760059332 looks like this could solve the issue. I'll try and let you know.

Found your comment there as well :+1: @alexcpn https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-999385528

Aarivvk commented 2 years ago

Below solution works for me. sudo gedit /etc/default/grub Append the systemd.unified_cgroup_hierarchy=0 at the end of the "GRUB_CMDLINE_LINUX_DEFAULT", something like below. GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0" save the file, run the grub update. sudo update-grub finally reboot and your docker should work with gpu's.

NOTE1: making this changes to "cgroup" should be handheld with care. May be apps depends cgroup may not work. More info here"https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01" NOTE2: There are other ways to fix this, you can find the above links provided where the docker config and sharing few devices to the docker solves the issue, this is simple fix. NOTE3: I personally would prefers the NOTE2.

mmstick commented 2 years ago

I'm working on it

mmstick commented 2 years ago

Fixed, but will still require systemd.unified_cgroup_hierarchy=0 to be added as a kernel option. On EFI systems, sudo kernelstub -a systemd.unified_cgroup_hierarchy=0. At least until NVIDIA release v1.8.0 of their container runtime tools.

Updates will be available on Impish soon, with a new nvidia-docker2 package that replaces nvidia-container-runtime.