skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[k8s] Fix check pod privileges #4270

Closed romilbhardwaj closed 2 weeks ago

romilbhardwaj commented 2 weeks ago

4240 introduced a bug when provisioning multi-node clusters with a custom image that does not have sudo installed:

sky launch -c test --num-nodes 4 --cloud kubernetes --image-id  nvcr.io/nvidia/nemo:24.05.01 -- echo hi
Would fail with sudo not found error

As an optimization #4240 had ran privilege check in only the head node, but it's necessary to be run in all pods to make sure sudo alias is setup correctly. This PR fixes that.

Tested with sky launch -c test --num-nodes 4 --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01 -- echo hi

romilbhardwaj commented 2 weeks ago

Closing in favor of #4297.