iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
3.99k stars 333 forks source link

Error when trying to use `latest-gpu` container inside GitHub actions workflow. #1428

Open daavoo opened 11 months ago

daavoo commented 11 months ago

Workflow is here:

https://github.com/iterative/example-get-started-experiments/blob/main/.github/workflows/dvc-studio.yml

Example failure is here:

https://github.com/iterative/example-get-started-experiments/actions/runs/6310277365/job/17131981606

  Status: Downloaded newer image for iterativeai/cml:latest-gpu
  docker.io/iterativeai/cml:latest-gpu
  /usr/bin/docker create --name d36559e92e4847fcb5d0a04521f541f1_iterativeaicmllatestgpu_f5f4d0 --label 70c3d0 --workdir /__w/example-get-started-experiments/example-get-started-experiments --network github_network_5dc1361cfcd641c69071c53a762bc452 --gpus all --ipc host -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/_work":"/__w" -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/externals":"/__e":ro -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/_work/_temp":"/__w/_temp" -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/_work/_actions":"/__w/_actions" -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/_work/_tool":"/__w/_tool" -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/_work/_temp/_github_home":"/github/home" -v "/tmp/tmp.cjcLZAuplu/.cml/cml-r5thu1mwal-1c6kgrd3-2k2cbdhp/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" iterativeai/cml:latest-gpu "-f" "/dev/null"
  e45972c2305532a031a11328f784587dd0ec9b98581fdfab529d350955e6a2ba
  /usr/bin/docker start e45972c2305532a031a11328f784587dd0ec9b98581fdfab529d350955e6a2ba
  Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
  Error: failed to start containers: e45972c2305532a031a11328f784587dd0ec9b98581fdfab529d350955e6a2ba
omesser commented 11 months ago

cc @iterative/cml