netbrain / zwift

Easily zwift on linux
The Unlicense
241 stars 27 forks source link

docker can't find libnvidia-ml.so.1 #33

Closed gregh3285 closed 11 months ago

gregh3285 commented 11 months ago

It's looking like when I envoke zwift.sh, the nvidia-container-cli can't find libnvidia-ml.so.1. Transcript below.

gregh@Iago:~$ zwift.sh
+ ZWIFT_HOME=/home/gregh/.config/zwift/gregh
+ mkdir -p /home/gregh/.config/zwift/gregh
+ IMAGE=docker.io/netbrain/zwift
+ VERSION=latest
+ mkdir -p /home/gregh/.config/zwift/gregh
+ [[ ! -n '' ]]
++ command -v podman
+ [[ -x '' ]]
+ CONTAINER_TOOL=docker
+ [[ ! -n '' ]]
+ docker pull docker.io/netbrain/zwift:latest
latest: Pulling from netbrain/zwift
Digest: sha256:f17fe247e55c70c0d8726a920cec45418a8e1a40190d967c8638b03cdd6b3444
Status: Image is up to date for netbrain/zwift:latest
docker.io/netbrain/zwift:latest
+ [[ -f /proc/driver/nvidia/version ]]
+ VGA_DEVICE_FLAG='--gpus all'
++ docker run -d --rm --privileged -e DISPLAY=:1 -v /tmp/.X11-unix:/tmp/.X11-unix -v /run/user/1000/pulse:/run/user/1000/pulse -v /home/gregh/.config/zwift/gregh:/home/user/Zwift --gpus all docker.io/netbrain/zwift:latest
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
+ CONTAINER=cc6105c56ab22531595581ee6efeb74aee627afb231b85d959fac4f5cf55fee6
+ [[ -z '' ]]
++ docker inspect '--format={{ .Config.Hostname  }}' cc6105c56ab22531595581ee6efeb74aee627afb231b85d959fac4f5cf55fee6
Error: No such object: cc6105c56ab22531595581ee6efeb74aee627afb231b85d959fac4f5cf55fee6
+ xhost +local:
non-network local connections being added to access control list

Ubuntu reports the following version of nvidia:

gregh@Iago:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.113.01  Tue Sep 12 19:41:24 UTC 2023
GCC version: 

Looking at nvidia-container-cli, it seems to be in the know of the specific libraries it needs:

gregh@Iago:~$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.535.113.01
/usr/lib/x86_64-linux-gnu/libcuda.so.535.113.01
/usr/lib/x86_64-linux-gnu/libcudadebugger.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvoptix.so.535.113.01
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.535.113.01
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.535.113.01
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.535.113.01
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.535.113.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.113.01
/run/nvidia-persistenced/socket
/lib/firmware/nvidia/535.113.01/gsp_ga10x.bin
/lib/firmware/nvidia/535.113.01/gsp_tu10x.bin

I can see the libraries exist both as the base .1 library and as the specific for this driver:

gregh@Iago:~$ ls -al /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
lrwxrwxrwx 1 root root 26 Sep 25 04:32 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
gregh@Iago:~$ ls -al /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.113.01
-rw-r--r-- 1 root root 1819968 Sep 25 04:32 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.113.01

I'm not sure if this an issue with this docker or with my configuration of the nvidia docker. Looking to see if anyone else has a clue.

netbrain commented 11 months ago

https://github.com/NVIDIA/nvidia-docker/issues/1163

gregh3285 commented 11 months ago

So, further debugging clearly shows this has nothing to do with the zwift docker. Something, unrelated, is hosed on my end. The nvidia-docker link above wasn't helpful, unfortunately. Closing this issue.

oldnapalm commented 11 months ago

I had this issue with Docker-desktop, are you using it? It works using plain Docker though.

netbrain commented 11 months ago

Wait, you used Docker desktop? For Darwin/mac?

On Tue, Oct 17, 2023, 13:52 oldnapalm @.***> wrote:

I had this issue with Docker-desktop, are you using it? It works using plain Docker though.

— Reply to this email directly, view it on GitHub https://github.com/netbrain/zwift/issues/33#issuecomment-1766261329, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACTNC6JUZQF4R5JO2YHXE3X7ZWQ7AVCNFSM6AAAAAA6BILZXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRWGI3DCMZSHE . You are receiving this because you commented.Message ID: @.***>

oldnapalm commented 11 months ago

Wait, you used Docker desktop? For Darwin/mac?

For Ubuntu Linux (https://docs.docker.com/desktop/install/ubuntu/) and got the same error as the OP. After reading this https://github.com/NVIDIA/nvidia-container-toolkit/issues/219 I tried without desktop and it worked.

gregh3285 commented 11 months ago

I'm running Ubuntu 23.04. I have the following packages related to docker and container installed:

gregh@Iago:~$ apt list --installed | grep docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker.io/lunar-updates,now 24.0.5-0ubuntu1~23.04.1 amd64 [installed,automatic]
docker/lunar,lunar,now 1.5-2 all [installed]
nvidia-docker2/unknown,now 2.13.0-1 all [installed]
wmdocker/lunar,now 1.5-2 amd64 [installed,automatic]
gregh@Iago:~$ apt list --installed | grep container

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

containerd/lunar-updates,now 1.7.2-0ubuntu1~23.04.1 amd64 [installed,automatic]
libnvidia-container-tools/unknown,now 1.14.3-1 amd64 [installed,automatic]
libnvidia-container1/unknown,now 1.14.3-1 amd64 [installed,automatic]
nvidia-container-toolkit-base/unknown,now 1.14.3-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.14.3-1 amd64 [installed]
jordimassaguerpla commented 10 months ago

I had a similar issue and I fixed by installing nvidia-computeG05