skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.74k stars 501 forks source link

[Core][Environments] `/sbin` not in the system path for some of the GCP VMs #3203

Open cblmemo opened 8 months ago

cblmemo commented 8 months ago

/sbin is not in the system path for certain GCP VMs. This is causing the vLLM + Mixtral example need to update the system path: https://github.com/skypilot-org/skypilot/blob/f4541059718a4469177079300e57370c1a4a052c/llm/mixtral/serve.yaml#L38

To reproduce:

$ sky launch --cloud gcp -c bug-env
$ sky exec bug-env 'echo $PATH'
(sky-cmd, pid=7266) /opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
$ sky exec bug-env 'ldconfig'
(sky-cmd, pid=7267) bash: ldconfig: command not found
ERROR: Job 1 failed with return code list: [127]
$ sky exec bug-env 'PATH=$PATH:/sbin ldconfig'
(sky-cmd, pid=8136) ldconfig: Can't create temporary cache file /etc/ld.so.cache~: Permission denied

Notice that in Azure's default image, it is included in the system path.

$ sky launch --cloud azure -c bug-env-azure
$ sky exec bug-env-azure 'echo $PATH'
(sky-cmd, pid=9183) /home/azureuser/miniconda3/bin:/home/azureuser/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/cuda/bin
$ sky exec bug-env-azure 'ldconfig'
(sky-cmd, pid=10248) /sbin/ldconfig.real: Can't link /lib/x86_64-linux-gnu/libnvperf_host.so to libnvperf_dcgm_host.so
(sky-cmd, pid=10248) /sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Permission denied

We should address such inconsistency between clouds.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.