ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.18k stars 5.8k forks source link

[Autoscaler] Multiple available_node_types documentation #39788

Open mjrlee opened 1 year ago

mjrlee commented 1 year ago

Description

The ray config allows us to specify multiple available_node_types.

It is not clear from the documentation what happens if you specify multiple node types that are interchangeable in terms of CPU/RAM, or when multiples of one instance type could provide the resources of another instance type.

Link

No response

scottsun94 commented 1 year ago

It seems that the ask is to clearly state how autoscaler determines which node types to start and the underlying priority.

anyscalesam commented 1 year ago

@mjrlee responded in https://github.com/ray-project/ray/issues/39789#issuecomment-1734649494 < can you advise?

mjrlee commented 1 year ago

@anyscalesam I don't think that answers this question, it's still not clear how the ray autoscaler decides which node type to start.

rickyyx commented 1 year ago

What's the usecases for multiple available_node_types here? Maybe just some high-level examples would be really helpful!

mjrlee commented 1 year ago

I’d like to specify multiple spot instance types and if one request fails because of a lack of capacity it tries the next.

In general it’s just not clear what happens if the user specifies multiple node types with the same resources. From glancing at the code it will just use the first one that can satisfy the requirements, but I’d like to be sure.

On Thu, 28 Sep 2023, at 01:28, Ricky Xu wrote:

What's the usecases for multiple available_node_types here? Maybe just some high-level examples would be really helpful!

— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/39788#issuecomment-1738237143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2WREDXUNKJM42TYQRXY3X4SZCXANCNFSM6AAAAAA5B524WQ. You are receiving this because you were assigned.Message ID: @.***>

mjrlee commented 1 year ago

Another potential use case: Specify one spot node type and the same node type as on-demand. If the spot request fails, then start an on-demand node in its place.

rickyyx commented 1 year ago

I’d like to specify multiple spot instance types and if one request fails because of a lack of capacity it tries the next. In general it’s just not clear what happens if the user specifies multiple node types with the same resources. From glancing at the code it will just use the first one that can satisfy the requirements, but I’d like to be sure. On Thu, 28 Sep 2023, at 01:28, Ricky Xu wrote: What's the usecases for multiple available_node_types here? Maybe just some high-level examples would be really helpful! — Reply to this email directly, view it on GitHub <#39788 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2WREDXUNKJM42TYQRXY3X4SZCXANCNFSM6AAAAAA5B524WQ. You are receiving this because you were assigned.Message ID: @.***>

Yeah, I think there was some pending work to take into account node availability in choosing which node type to launch, but as of now, the autoscaler is naive that it's not aware of this.

It has some heuristics of choosing which is the "best" node type here: https://github.com/ray-project/ray/blob/5a6d78ce47ab84ee681d267c0b34c3c5c2bf7b7b/python/ray/autoscaler/_private/resource_demand_scheduler.py#L808-L813

rickyyx commented 1 year ago

Another potential use case: Specify one spot node type and the same node type as on-demand. If the spot request fails, then start an on-demand node in its place.

This is definitely a possible extension. We are actively looking into this and will update once we have an API for review.

zakajd commented 1 year ago

Just want to drop my +1 on better documentation of autoscaling behaviour as well as options for providing same resource, different launch types (spot / on-demand) nodes. Current behaviour is to retry the same node type indefinitely which leads to errors if capacity is not available.