skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 512 forks source link

Update K8s docker image build and the source artifact registry #4224

Closed yika-luo closed 3 weeks ago

yika-luo commented 3 weeks ago

This PR depends on https://github.com/skypilot-org/skypilot-catalog/pull/98

Impact of this change:

  1. Make setup.py the single source of truth of sky dependencies
  2. Host images in skypilot's own GCP projects.
  3. Performance gain: ~10s speed up in k8s sky cluster creation
VM Type 💻 Old Provision 🕐 New Provision 🕐 % Speedup ✅
K8 CPU 58s 34s 40% (2x)
K8 GPU 42s 34s 20% (1.5x)

Tested (run the relevant ones):

yika-luo commented 3 weeks ago

Thanks for adding this @yika-luo! Could we test the speed up with --system-site-packages removed from our setup script in #4168? I suspect the speed up would be more significant.

Also, it is a bit surprising that the provisioning for CPU is longer than GPU. Is it because the CPU image is larger than GPU image? If so, maybe we should move to ubuntu base image instead.

Oh I ran CPU on my macbook.. I ran again on the same GPU instance and refreshed the result. They look comparable now. I also tested removing --system-site-packages and it only gained 1s, but at least no regression :)