I'll add more details tomorrow, but in short... following that example, without modification, I saw nvidia-driver-daemonset pods from the gpu-operator helm chart getting stuck in ImagePullBackOff, with an error like this:
Failed to pull image "nvcr.io/nvidia/driver:550.90.07-amzn2": rc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:550.90.07-amzn2": failed to resolve reference "nvcr.io/nvidia/driver:550.90.07-amzn2": nvcr.io/nvidia/driver:550.90.07-amzn2: not found
Notes
(placeholder: will add more soon)
References
Some relevant references I consulted while debugging this
Description
I believe the walk-through at https://docs.rapids.ai/deployment/stable/cloud/aws/eks/ requires some modifications.
I'll add more details tomorrow, but in short... following that example, without modification, I saw
nvidia-driver-daemonset
pods from thegpu-operator
helm chart getting stuck in ImagePullBackOff, with an error like this:Notes
(placeholder: will add more soon)
References
Some relevant references I consulted while debugging this
195