go-skynet / helm-charts

go-skynet helm chart repository
53 stars 38 forks source link

failed to create containerd task #39

Closed joshuacox closed 7 months ago

joshuacox commented 7 months ago

When installing localai via the helm chart I get this error:

 Warning  Failed            49m (x5 over 51m)     kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: requirement error: invalid expression: unknown

I have the gpu-operator installed and running.

And I can run this gpu-pod succesfully:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
k logs gpu-pod                  
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Why won't the localAI pod run?

joshuacox commented 7 months ago

solved:

sudo ln -s /usr/sbin/nvidia-container-cli /usr/sbin/nvidia-container-cli.real

on the nodes with gpu

joshuacox commented 7 months ago

spoke too soon, its back in the crashloopbackoff

joshuacox commented 7 months ago

digging through the containerd logs reveals a bit more:

Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.731005862-06:00" level=info msg="StartContainer for \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\""
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.919871873-06:00" level=info msg="shim disconnected" id=f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.919898950-06:00" level=warning msg="cleaning up after shim disconnected" id=f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.919903648-06:00" level=info msg="cleaning up dead shim" namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.929076405-06:00" level=warning msg="cleanup warnings time=\"2024-02-19T12:52:46-06:00\" level=warning msg=\"failed to read init pid file\" error=\"open /run/containerd/io.containerd.runtime.v2.task/k8s.io/f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a/init.pid: no such file or directory\" runtime=io.containerd.runc.v2\n" namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.929607863-06:00" level=error msg="copy shim log" error="read /proc/self/fd/235: file already closed" namespace=k8s.io
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.930021650-06:00" level=error msg="Failed to pipe stdout of container \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\"" error="reading from a closed fifo"
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.930030008-06:00" level=error msg="Failed to pipe stderr of container \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\"" error="reading from a closed fifo"
Feb 19 12:52:46 eskimo containerd[195428]: time="2024-02-19T12:52:46.931101566-06:00" level=error msg="StartContainer for \"f4cde3d4e6bbb5493ef7adb19657353a30bacd5c1271c82252dcf2a7155e703a\" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: requirement error: invalid expression: unknown"
joshuacox commented 7 months ago

running the image directly with docker has similar results:

docker run -it quay.io/go-skynet/local-ai:latest                                                                                                                                                      
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: invalid expression: unknown.
ERRO[0000] error waiting for container: context canceled 
joshuacox commented 7 months ago

docker run -it quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

using that tag works, but not latest, possibly because the latest image does not work when the default runtime is nvidia?

Anyhow, I'm closing as I now see it running inside of k8s as well.