Closed joshuacox closed 7 months ago
solved:
sudo ln -s /usr/sbin/nvidia-container-cli /usr/sbin/nvidia-container-cli.real
on the nodes with gpu
spoke too soon, its back in the crashloopbackoff
digging through the containerd logs reveals a bit more:
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.731005862-06:00" level=info msg="StartContainer for \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\""
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.919871873-06:00" level=info msg="shim disconnected" id=f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.919898950-06:00" level=warning msg="cleaning up after shim disconnected" id=f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.919903648-06:00" level=info msg="cleaning up dead shim" namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.929076405-06:00" level=warning msg="cleanup warnings time=\"2024-02-19T12:52:46-06:00\" level=warning msg=\"failed to read init pid file\" error=\"open /run/containerd/io.containerd.runtime.v2.task/k8s.io/f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.929607863-06:00" level=error msg="copy shim log" error="read /proc/self/fd/235: file already closed" namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.930021650-06:00" level=error msg="Failed to pipe stdout of container \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\"" error="reading from a closed fifo"
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.930030008-06:00" level=error msg="Failed to pipe stderr of container \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\"" error="reading from a closed fifo"
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.931101566-06:00" level=error msg="StartContainer for \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: requirement error: invalid expression: unknown"
running the image directly with docker has similar results:
docker run -it quay.io/go-skynet/local-ai:latest
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: invalid expression: unknown.
ERRO[0000] error waiting for container: context canceled
docker run -it quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg
using that tag works, but not latest, possibly because the latest image does not work when the default runtime is nvidia?
Anyhow, I'm closing as I now see it running inside of k8s as well.
When installing localai via the helm chart I get this error:
I have the gpu-operator installed and running.
And I can run this gpu-pod succesfully:
Why won't the localAI pod run?