skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[Jobs] Speed up the time for managed jobs to be scheduled #4294

Closed Michaelvll closed 6 days ago

Michaelvll commented 2 weeks ago

When submitting a lot of managed jobs, the managed jobs will be scheduled slower than the submission speed. This is on a jobs controller with 16 core CPUs.

import subprocess
from multiprocessing.pool import ThreadPool

def run_task(task):
    print(f'Running task {task}')
    subprocess.run(
        f'sky jobs launch -n job-{task} -dy --fast --cloud aws -t t3.medium --use-spot "echo hi {task}; sleep 3600"',
        shell=True
    )

with ThreadPool(8) as pool:
    pool.map(run_task, range(1000))

The gap between jobs being scheduled can be more than 20 seconds.

image

Version & Commit info: