pytorch / test-infra

This repository hosts code that supports the testing infrastructure for the PyTorch organization. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.
https://hud.pytorch.org/
Other
83 stars 81 forks source link

Pin nvidia-container-toolkit to version 1.16.2 #5852

Closed ZainRizvi closed 1 month ago

ZainRizvi commented 1 month ago

Yesterday's nvidia-container-toolkit v1.17.0 release seems to have broken some of our domain images, causing `docker run --gpus all [image]" to fail with the error:

$ docker run --gpus all [IMAGE]
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown.
ERRO[0000] error waiting for container: context canceled 

Pinning the toolkit to the previous version to mitigate the failure for now

Testing:

vercel[bot] commented 1 month ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment | Name | Status | Preview | Comments | Updated (UTC) | | :--- | :----- | :------ | :------- | :------ | | **torchci** | ⬜️ Ignored ([Inspect](https://vercel.com/fbopensource/torchci/8Ub3omSKLsxhNZL64iXihbjwzNdh)) | | | Nov 1, 2024 7:18pm |
huydhn commented 1 month ago

The change log https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.0 also mentions that it adds:

Add disable-imex-channel-creation feature flag

and

Add no-create-imex-channels command line option

So, that might work too (whatever it is, I have no idea what imex is)