rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 28 forks source link

EKS example does not work by default #409

Open jameslamb opened 1 month ago

jameslamb commented 1 month ago

Description

I believe the walk-through at https://docs.rapids.ai/deployment/stable/cloud/aws/eks/ requires some modifications.

I'll add more details tomorrow, but in short... following that example, without modification, I saw nvidia-driver-daemonset pods from the gpu-operator helm chart getting stuck in ImagePullBackOff, with an error like this:

Failed to pull image "nvcr.io/nvidia/driver:550.90.07-amzn2": rc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:550.90.07-amzn2": failed to resolve reference "nvcr.io/nvidia/driver:550.90.07-amzn2": nvcr.io/nvidia/driver:550.90.07-amzn2: not found

Notes

(placeholder: will add more soon)

References

Some relevant references I consulted while debugging this