[Enhancement][Sky Serve] Support a mix of spot and on-demand instances

skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

https://skypilot.readthedocs.io

Apache License 2.0

6.66k stars 492 forks source link

[Enhancement][Sky Serve] Support a mix of spot and on-demand instances #2529

Closed steve-marmalade closed 3 months ago

steve-marmalade commented 1 year ago

Hey there, I recently saw #2528 and thought of an enhancement that I would personally find very valuable, so I figured I'd share it here.

The feature in that PR is to allow spot instances to serve inference requests. This is indeed a very useful feature for managing costs, but it's not something I could depend on in production, because the availability of spot instances can dry up suddenly if large compute jobs are launched.

What would be amazing, is to have a feature that says "if the number of instances is below a threshold, then spin up regular instances to maintain high availability". Once demand subsides, then these would be replaced with spot nodes.

In short: availability is the highest priority, followed by saving cost as a secondary concern.

infwinston commented 1 year ago

Hey @steve-marmalade thanks! the mix use of on-demand + spot is definitely one of the next steps we're planning. We'd love to learn more on your use case. for example, would the option of 1 on-demand + 1 spot be preferred over 2 on-demand in your case? the former gives you ~30% saving and some guarantee that the service is still UP.

steve-marmalade commented 1 year ago

Hi @infwinston , that's awesome to hear. Yes, I think the option of 1 on-demand + 1 spot would be preferred, for the reason that you mention.

To keep high availability and minimize cost, I imagine the algorithm could get pretty complex? For example, we could have only spot instances running (while they're available) and once we notice that they're being pre-empted quickly (or we are dropping below a threshold), we starting bringing up on-demand instances to compensate.

But building on your example, maybe a simpler version to start is to always have some number of on-demand nodes, and then use spot instances for cheap scalability. This ensures the service would never go down completely.

abhimasand commented 10 months ago

@infwinston The option of mixed on-demand and spot instances would be highly beneficial. Has there been any work on this? If not, I am open to contribute and help in any way I can

infwinston commented 10 months ago

Yes, @MaoZiming is leading this development! would you mind sharing your use case and requirements?

abhimasand commented 10 months ago

Hi @infwinston -- The primary use case is to self-host 7B-34B models and give access to them via API endpoints (with some kind of authentication). Some good things to have would be the ability to reduce the cold start time to ~10-20s as done by Baseten and llm-engine, and give the ability to even scale to zero while ensure snappy autoscaling and high availability.

abhimasand commented 8 months ago

Hi @infwinston and @MaoZiming, just checking in to see if there have been any updates on this issue?

MaoZiming commented 8 months ago

Hi, @abhimasand thanks for the interest and sorry for the delay. We are planning to add two additional fields in the SkyServe yaml to better support spot instances: base_ondemand_fallback_replicas specifies a base number of on-demand instances in addition to spot instances. dynamic_ondemand_fallback: true allows SkyServe to replenish preempted spot instances with on-demand instances, and downscale on-demand instances when spot availability is back. Let us know if you have any feedback : )

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Michaelvll commented 3 months ago

This has been fixed by #3194