Open mjrlee opened 1 year ago
It seems that the ask is to clearly state how autoscaler determines which node types to start and the underlying priority.
@mjrlee responded in https://github.com/ray-project/ray/issues/39789#issuecomment-1734649494 < can you advise?
@anyscalesam I don't think that answers this question, it's still not clear how the ray autoscaler decides which node type to start.
What's the usecases for multiple available_node_types here? Maybe just some high-level examples would be really helpful!
I’d like to specify multiple spot instance types and if one request fails because of a lack of capacity it tries the next.
In general it’s just not clear what happens if the user specifies multiple node types with the same resources. From glancing at the code it will just use the first one that can satisfy the requirements, but I’d like to be sure.
On Thu, 28 Sep 2023, at 01:28, Ricky Xu wrote:
What's the usecases for multiple available_node_types here? Maybe just some high-level examples would be really helpful!
— Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/39788#issuecomment-1738237143, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2WREDXUNKJM42TYQRXY3X4SZCXANCNFSM6AAAAAA5B524WQ. You are receiving this because you were assigned.Message ID: @.***>
Another potential use case: Specify one spot node type and the same node type as on-demand. If the spot request fails, then start an on-demand node in its place.
I’d like to specify multiple spot instance types and if one request fails because of a lack of capacity it tries the next. In general it’s just not clear what happens if the user specifies multiple node types with the same resources. From glancing at the code it will just use the first one that can satisfy the requirements, but I’d like to be sure. … On Thu, 28 Sep 2023, at 01:28, Ricky Xu wrote: What's the usecases for multiple available_node_types here? Maybe just some high-level examples would be really helpful! — Reply to this email directly, view it on GitHub <#39788 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2WREDXUNKJM42TYQRXY3X4SZCXANCNFSM6AAAAAA5B524WQ. You are receiving this because you were assigned.Message ID: @.***>
Yeah, I think there was some pending work to take into account node availability in choosing which node type to launch, but as of now, the autoscaler is naive that it's not aware of this.
It has some heuristics of choosing which is the "best" node type here: https://github.com/ray-project/ray/blob/5a6d78ce47ab84ee681d267c0b34c3c5c2bf7b7b/python/ray/autoscaler/_private/resource_demand_scheduler.py#L808-L813
Another potential use case: Specify one spot node type and the same node type as on-demand. If the spot request fails, then start an on-demand node in its place.
This is definitely a possible extension. We are actively looking into this and will update once we have an API for review.
Just want to drop my +1 on better documentation of autoscaling behaviour as well as options for providing same resource, different launch types (spot / on-demand) nodes. Current behaviour is to retry the same node type indefinitely which leads to errors if capacity is not available.
Description
The ray config allows us to specify multiple available_node_types.
It is not clear from the documentation what happens if you specify multiple node types that are interchangeable in terms of CPU/RAM, or when multiples of one instance type could provide the resources of another instance type.
Link
No response