ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.11k stars 5.6k forks source link

[aws][autoscaler] AWS: When using spot instances, always single availability zone is selected #24310

Open iirekm opened 2 years ago

iirekm commented 2 years ago

What happened + What you expected to happen

With config region: us-east-1 - always the last AZ is selected (us-east-1f) for AWS spot request.
When I list all AZs manually (availability_zone: us-east-1d,us-east-1e,us-east-1f,us-east-1a,us-east-1b,us-east-1c) - always first is selected!

Expected behavior should be to select ALL availability zones. Maybe it's less important for on demand instances, but often spot instances aren't available in a zone, but are available in others (especially when it comes to GPUs).

Versions / Dependencies

recent

Reproduction script

-

Issue Severity

Medium: It is a significant difficulty but I can work around it.

iirekm commented 2 years ago

I found a workaround: go to https://aws.amazon.com/ec2/spot/instance-advisor/ and find instance type that is least likely to be interrupted. But anyway multi AZ support would be useful, because single AZs sometimes fail.

mdagost commented 2 years ago

I'll add a heavy ➕ to this ticket.

wuisawesome commented 1 year ago

I suspect the issue here is related to explainability of the aws node provider's actions. When a request fails, it implicitly falls back to other AZs, then only reports the last error, which makes it seem like it only tries a single AZ.

@iirekm @mdagost if either of you are still running into this issue, could you try to check if spot instances are actually available in the other AZs? I suspect they won't be for the reason above.

Note that this is still a usability issue that we should try to fix, just trying to understand the exact issue.