Closed alexcpn closed 2 years ago
I gave distribution=ubuntu20.04
and tried to install but
distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
> sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get install nvidia-docker2
... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
nvidia-docker2 : Depends: nvidia-container-runtime (>= 3.5.0) but 3.4.0-1pop1~1601325114~20.04~2880fc6 is to be installed
E: Unable to correct problems, you have held broken packages.
looks like https://github.com/NVIDIA/nvidia-docker/issues/1388#issuecomment-698850593; I followed the WA given here --> https://github.com/NVIDIA/nvidia-docker/issues/1388#issuecomment-698989404 and able to install nvidia-docker2
and at least cuda in docker is working. But not tensorman still gives same error
sudo docker run --rm --runtime=nvidia -ti nvidia/cuda:11.3.0-base-ubuntu20.04
Unable to find image 'nvidia/cuda:11.3.0-base-ubuntu20.04' locally
11.3.0-base-ubuntu20.04: Pulling from nvidia/cuda
a70d879fa598: Pull complete
c4394a92d1f8: Pull complete
10e6159c56c0: Pull complete
f1ff119ac131: Pull complete
3e2dbc551fee: Pull complete
4f57fe919a49: Pull complete
216bbbf373ef: Pull complete
Digest: sha256:7939995fc912a21e62be16c866b62e14d383ef16ed288f1d17268ba0b7226574
Status: Downloaded newer image for nvidia/cuda:11.3.0-base-ubuntu20.04
root@159db2b73ded:/# nvidia-smi
Mon May 17 17:07:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce MX250 Off | 00000000:02:00.0 Off | N/A |
| N/A 54C P0 N/A / N/A | 292MiB / 2002MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
alexpunnen@pop-os:~$ sudo tensorman run --gpu bash
"docker" "run" "-u" "0:0" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/alexpunnen:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "bash"
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
This issue is keeping me away from upgrading to 21.10. Kindly fix the issue or give us the work around until then.
https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-760059332 looks like this could solve the issue. I'll try and let you know.
Found your comment there as well :+1: @alexcpn https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-999385528
Below solution works for me.
sudo gedit /etc/default/grub
Append the systemd.unified_cgroup_hierarchy=0 at the end of the "GRUB_CMDLINE_LINUX_DEFAULT", something like below.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash systemd.unified_cgroup_hierarchy=0"
save the file, run the grub update.
sudo update-grub
finally reboot and your docker should work with gpu's.
NOTE1: making this changes to "cgroup" should be handheld with care. May be apps depends cgroup may not work. More info here"https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01" NOTE2: There are other ways to fix this, you can find the above links provided where the docker config and sharing few devices to the docker solves the issue, this is simple fix. NOTE3: I personally would prefers the NOTE2.
I'm working on it
Fixed, but will still require systemd.unified_cgroup_hierarchy=0
to be added as a kernel option. On EFI systems, sudo kernelstub -a systemd.unified_cgroup_hierarchy=0
. At least until NVIDIA release v1.8.0 of their container runtime tools.
Updates will be available on Impish soon, with a new nvidia-docker2
package that replaces nvidia-container-runtime
.
nvidia container runtime is installed
Error
Driver is installed
Should we install nvidia-docker2 ? I was not able to install it