Open wjzhou-ep opened 1 year ago
@wjzhou-ep @scv119
We seem to be seeing this behaviour too. Maybe slightly different. But essentially the autoscaler is starting more nodes than necessary.
Initial request for 24x:
{'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:01,163 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 24 new nodes for launch
2023-07-04 22:44:01,164 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 24 nodes to launch.
2023-07-04 22:44:01,164 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 24 nodes, type workergroup.
2023-07-04 22:44:01,164 INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:01,164 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 24}]
2023-07-04 22:44:01,193 INFO autoscaler.py:462 -- The autoscaler took 0.075 seconds to complete the update iteration.
2023-07-04 22:44:01,193 INFO monitor.py:428 -- :event_summary:Adding 24 node(s) of type workergroup.
2023-07-04 22:44:06,283 INFO node_provider.py:257 -- Fetched pod data at resource version 210040046.
2023-07-04 22:44:06,283 INFO autoscaler.py:143 -- The autoscaler took 0.056 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:06,284 INFO autoscaler.py:419 --
A few loops later it starts queuing unnecessary nodes.
{'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:16,577 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 2 new nodes for launch
2023-07-04 22:44:16,577 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 2 nodes to launch.
2023-07-04 22:44:16,577 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 2 nodes, type workergroup.
2023-07-04 22:44:16,577 INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:16,577 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 26}]
2023-07-04 22:44:16,608 INFO autoscaler.py:462 -- The autoscaler took 0.097 seconds to complete the update iteration.
2023-07-04 22:44:16,609 INFO monitor.py:428 -- :event_summary:Resized to 64 CPUs, 2 GPUs.
2023-07-04 22:44:16,609 INFO monitor.py:428 -- :event_summary:Adding 2 node(s) of type workergroup.
2023-07-04 22:44:21,724 INFO node_provider.py:257 -- Fetched pod data at resource version 210040402.
2023-07-04 22:44:21,724 INFO autoscaler.py:143 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:21,725 INFO autoscaler.py:419 --
And then some more
{'CPU': 32.0, 'GPU': 1.0}: 16+ pending tasks/actors
2023-07-04 22:44:32,167 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 6 new nodes for launch
2023-07-04 22:44:32,167 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 6 nodes to launch.
2023-07-04 22:44:32,167 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 6 nodes, type workergroup.
2023-07-04 22:44:32,168 INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-senorchang-ray-1688535753 in namespace anon-coder-prod.
2023-07-04 22:44:32,168 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 32}]
2023-07-04 22:44:32,203 INFO autoscaler.py:462 -- The autoscaler took 0.176 seconds to complete the update iteration.
2023-07-04 22:44:32,204 INFO monitor.py:428 -- :event_summary:Resized to 512 CPUs, 16 GPUs.
2023-07-04 22:44:32,204 INFO monitor.py:428 -- :event_summary:Adding 6 node(s) of type workergroup.
2023-07-04 22:44:37,303 INFO node_provider.py:257 -- Fetched pod data at resource version 210040797.
2023-07-04 22:44:37,304 INFO autoscaler.py:143 -- The autoscaler took 0.067 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:37,305 INFO autoscaler.py:419 --
We don't understand why?
The issue for us is they pend forever due to constraints around the pods and adding to resource quota counts which aren't valid.
Attached logs: autoscaler.txt
Seeing the same as well.
No jobs running...
======== Autoscaler status: 2023-07-18 02:18:09.914151 ========
Node status
---------------------------------------------------------------
Healthy:
4 worker_node
1 head_node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/24.0 CPU
0B/29.45GiB memory
0B/12.47GiB object_store_memory
Demands:
(no resource demands)
2023-07-18 02:18:09,916 INFO autoscaler.py:470 -- The autoscaler took 0.065 seconds to complete the update iteration.
2023-07-18 02:18:14,984 INFO autoscaler.py:147 -- The autoscaler took 0.053 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:14,985 INFO autoscaler.py:427 --
Then 5x 4 CPUs jobs started with 24 CPUs already available, 4 more instances launched unnecessarily
======== Autoscaler status: 2023-07-18 02:18:14.985085 ========
Node status
---------------------------------------------------------------
Healthy:
4 worker_node
1 head_node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
20.0/24.0 CPU
0B/29.45GiB memory
0B/12.47GiB object_store_memory
Demands:
{'CPU': 4.0}: 5+ pending tasks/actors
2023-07-18 02:18:14,987 INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 4 new nodes for launch
2023-07-18 02:18:14,987 INFO autoscaler.py:470 -- The autoscaler took 0.056 seconds to complete the update iteration.
2023-07-18 02:18:14,988 INFO node_launcher.py:166 -- NodeLauncher0: Got 4 nodes to launch.
2023-07-18 02:18:16,976 INFO node_launcher.py:166 -- NodeLauncher0: Launching 4 nodes, type worker_node.
2023-07-18 02:18:20,149 INFO autoscaler.py:147 -- The autoscaler took 0.141 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:20,150 INFO autoscaler.py:427 --
======== Autoscaler status: 2023-07-18 02:18:20.150147 ========
Node status
---------------------------------------------------------------
Healthy:
4 worker_node
1 head_node
Pending:
172.31.3.147: worker_node, uninitialized
172.31.14.20: worker_node, uninitialized
172.31.5.8: worker_node, uninitialized
172.31.13.158: worker_node, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
20.0/24.0 CPU
0B/29.45GiB memory
74.59MiB/12.47GiB object_store_memory
Demands:
(no resource demands)
2023-07-18 02:18:20,152 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0231fa18511c6f83f.
2023-07-18 02:18:20,152 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0970cd9533d7fb8fe.
2023-07-18 02:18:20,153 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0c5c4df3cac069b0e.
2023-07-18 02:18:20,154 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0db0afd004f36a21a.
2023-07-18 02:18:20,155 INFO autoscaler.py:470 -- The autoscaler took 0.147 seconds to complete the update iteration.
2023-07-18 02:18:20,156 INFO monitor.py:423 -- :event_summary:Adding 4 node(s) of type worker_node.
2023-07-18 02:18:25,313 INFO autoscaler.py:147 -- The autoscaler took 0.136 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:25,314 INFO autoscaler.py:427 --
I'm also seeing this in a cluster I started on AWS EC2 instances with ray up
. There's an additional wrinkle in my case where the "extra" worker also is oversized for the work that's actually in the queue.
The config (I've put the whole thing at the end of this comment) includes multiple node types, two of which are ray.worker.ray-dev-r6a.2xlarge
and ray.worker.ray-dev-r6a.4xlarge
which are configured in the autoscaler with 47.8 GiB and 89.6 GiB of memory respectively. I start with just the head node running, which has a worker too small for the workload I'm testing. Then start a workload with a requirement of 35 GiB of memory. At first a 2xlarge
worker node starts as expected, but right as it finishes start the autoscaler also starts a 4xlarge
node which is both unnecessary and too large. The job ends up scheduling on the 2xlarge
as expected and the 4xlarge
node just ends up shutting down after being idle.
Autoscaler log:
======== Autoscaler status: 2023-07-20 07:29:54.064934 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.head.ray-dev
Pending:
10.128.3.31: ray.worker.ray-dev-r6a.2xlarge, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/2.0 CPU
0B/4.36GiB memory
0B/2.18GiB object_store_memory
Demands:
{'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:54,066TINFO autoscaler.py:470 -- The autoscaler took 0.12 seconds to complete the update iteration.
2023-07-20 07:29:59,145TINFO autoscaler.py:147 -- The autoscaler took 0.051 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:29:59,146TINFO autoscaler.py:427 --
======== Autoscaler status: 2023-07-20 07:29:59.146404 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.head.ray-dev
1 ray.worker.ray-dev-r6a.2xlarge
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
4.0/10.0 CPU
35.00GiB/49.16GiB memory
0B/20.60GiB object_store_memory
Demands:
{'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:59,147TINFO autoscaler.py:1374 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-07-20 07:29:59,147TINFO autoscaler.py:470 -- The autoscaler took 0.053 seconds to complete the update iteration.
2023-07-20 07:29:59,148TINFO node_launcher.py:166 -- NodeLauncher1: Got 1 nodes to launch.
2023-07-20 07:29:59,148TINFO monitor.py:423 -- :event_summary:Resized to 10 CPUs.
2023-07-20 07:30:00,403TINFO node_launcher.py:166 -- NodeLauncher1: Launching 1 nodes, type ray.worker.ray-dev-r6a.4xlarge.
2023-07-20 07:30:04,291TINFO autoscaler.py:147 -- The autoscaler took 0.117 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:30:04,291TINFO autoscaler.py:427 --
Full cluster config:
auth: {ssh_user: ubuntu}
available_node_types:
ray.head.ray-dev:
node_config:
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs: {VolumeSize: 140, VolumeType: gp3}
ImageId: ami-0387d929287ab193e
InstanceType: m5.large
resources: {}
ray.worker.ray-dev-r6a.12xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.12xlarge
ray.worker.ray-dev-r6a.16xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.16xlarge
ray.worker.ray-dev-r6a.24xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.24xlarge
ray.worker.ray-dev-r6a.2xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.2xlarge
ray.worker.ray-dev-r6a.32xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.32xlarge
ray.worker.ray-dev-r6a.4xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.4xlarge
ray.worker.ray-dev-r6a.8xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.8xlarge
ray.worker.ray-dev-r6a.large:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.large
ray.worker.ray-dev-r6a.xlarge:
max_workers: 5
min_workers: 0
node_config:
IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
ImageId: ami-0387d929287ab193e
InstanceMarketOptions: {MarketType: spot}
InstanceType: r6a.xlarge
cluster_name: default
cluster_synced_files: []
docker:
container_name: ray_container
image: rayproject/ray:2.5.1-py39-cpu
pull_before_run: true
run_options: ['--ulimit nofile=65536:65536']
file_mounts: {}
file_mounts_sync_continuously: false
head_node_type: ray.head.ray-dev
head_setup_commands: []
head_start_ray_commands: [ray stop, ray start --head --port=6379 --object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0]
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 5
provider: {availability_zone: 'us-west-2a,us-west-2b,us-west-2c,us-west-2d', cache_stopped_nodes: true,
region: us-west-2, type: aws}
rsync_exclude: ['**/.git', '**/.git/**']
rsync_filter: [.gitignore]
setup_commands: []
upscaling_speed: 1.0
worker_setup_commands: []
worker_start_ray_commands: [ray stop, 'ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076']
Just wondering if anyone has figured out what's happening here? Still happening to us. As we constrain nodes it's just leaving pending pods littered throughout the different clusters.
as I said in the my original post, I believe there is a race condition on the reading of pending task and Usage:
The Usage
detects the running tasks (after node starts)
But the Pending task
was the old value, so, cluster scale up extra nodes for these Usage
+ Pending
, (The Usage tasks were counted twice, once in Usage and once in the pending )
Do we know when this is likely to come in as a fix?
cc @rickyyx can you try and repro and see if this is fixed since ray27 release in the v2 oss autoscaler?
2.7 should have patches that mitigate this - but this is essentially this https://github.com/ray-project/ray/issues/38189 as well.
Current plan is to fix in 2.8.
https://github.com/ray-project/ray/pull/40254 should close this.
@vitsai > chasing the breadcrumbs > which is the most promising GH issue / PR that we think will resolve this issue?
The fix for autoscaler v2 is in 2.8, the linked PR is for autoscaler v1.
I will downgrade the priority as V1 fix is less prioritized. Does it sound okay?
Reviewed with @rkooo567 @rynewang @vitsai > let's decide whether we should fix this in autoscaler v1.
This is fixed in v2 but we have an interim state in ray29 where default may still be autosclaer v1 in which case this issue will be there.
Next steps lets decide do we just skip/[ush to autoscaler v2 in ray210 or fix this regression.
@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc
Is it just enabling RAY_enable_autoscaler_v2=1
?
@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc
Hey @llidev we are still working on the fix with autoscaler v1. @vitsai has a PR here https://github.com/ray-project/ray/pull/40488: , while we try doing so, we are also working on v2 autoscaler. This has been delayed due to other priority, so it's not available with ray29 yet.
Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.
Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.
It's still under active dev - so still not yet ready.
We will close it after autoscaler v2 is enabled
Any progress on this one?
@DmitriGekhtman - not yet; v2 is still optional and not the default scaler for now.
Hmm, looks like we're in a state where autoscaler v1 functionality is gradually degrading but autoscaler v2 development is suspended (last feature commit was in March). (Totally understandable, I'm sure the maintainers have a lot on their hands.) Might proceed cautiously with adopting Ray autoscaling and try to collaborate on stability fixes where possible.
What happened + What you expected to happen
Autoscaler started extra worker while they are not needed.
From the following log, I believe it may have race condition on reading
Our tasks use 1 CPU, 30G memory, the worker has 120GB memory
At beginning, autoscaler started 12 workers correctly. (4 tasks each worker and 48 tasks)
Then, the moment the worker started, it seems there is a race condition on the usage and pending tasks. e.g. Usage: 10.0/176.0 CPU # 10 task is runing so, there should be 38 (48-10) tasks pending.
However, the autoscaler think there are still 48 task pending and start 3 extra workers for them
Versions / Dependencies
ray 2.5.1
Reproduction script
normal setup
Issue Severity
Low: It annoys or frustrates me.