skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.67k stars 493 forks source link

[Core] Failover through different instance types with same GPU #3193

Open Michaelvll opened 8 months ago

Michaelvll commented 8 months ago

This is an issue discovered in fluidstack implementation #3086. In fluidstack, the instances are heterogeneous across different regions, for example:

region acc acc_count vCPUs instance_type
norway_4_eu RTXA4000 1 6 rec3pUyh6pNkIjCaL
illinois_1_usa RTXA4000 1 8 custom:0:6B224766C0EF48A9A7E5E342DD771D26
new_york_1_usa RTXA4000 1 8 custom:0:6B224766C0EF48A9A7E5E342DD771D26

When a user specifies sky launch --gpus RTXA4000, our failover will only try the instance on norway_4_eu but not failover to the other regions with the GPU, since the instance type is different in the other regions.

There are two possible options for solving this:

  1. Allow the failover through different instance types with the same GPU but slightly different CPU numbers (this needs to be careful for T4 gpus on some other clouds, as the CPUs in those instances can vary a lot)
  2. Have a better hint when multiple candidates are available. It currently shows the following, maybe we can tell the user what the difference of those instances and how to allow failing over through all those possible instances, e.g. using any_of
    I 02-19 20:35:26 optimizer.py:925] Multiple Fluidstack instances satisfy RTXA4000:1. The cheapest Fluidstack(rec3pUyh6pNkIjCaL, {'RTXA4000': 1}) is considered among:
    I 02-19 20:35:26 optimizer.py:925] ['rec3pUyh6pNkIjCaL', 'custom:0:36F6353DC62E4E2397950DE5EC40BD26'].
    resources:
    any_of:
    - cpus: 6+
      accelerators: RTXA4000
    - cpus: 8+
      accelerators: RTXA4000
github-actions[bot] commented 3 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 3 months ago

This issue was closed because it has been stalled for 10 days with no activity.