ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.21k stars 5.62k forks source link

[Autoscaler] starts extra workers than necessary #36926

Open wjzhou-ep opened 1 year ago

wjzhou-ep commented 1 year ago

What happened + What you expected to happen

Autoscaler started extra worker while they are not needed.

From the following log, I believe it may have race condition on reading

Our tasks use 1 CPU, 30G memory, the worker has 120GB memory

At beginning, autoscaler started 12 workers correctly. (4 tasks each worker and 48 tasks)

Then, the moment the worker started, it seems there is a race condition on the usage and pending tasks. e.g. Usage: 10.0/176.0 CPU # 10 task is runing so, there should be 38 (48-10) tasks pending.

However, the autoscaler think there are still 48 task pending and start 3 extra workers for them

======== Autoscaler status: 2023-06-28 14:34:29.072047 ========
Node status
---------------------------------------------------------------
Healthy:
 11 large-group
 1 head-group
Pending:
 192.168.26.101: large-group, waiting
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 10.0/176.0 CPU
 0.0/1100.0 large-group
 279.40GiB/1.22TiB memory
 0.0/1100.0 no_gpu
 9.52KiB/181.94GiB object_store_memory

Demands:
 {'CPU': 1.0, 'memory': 30000000000.0}: 48+ pending tasks/actors
2023-06-28 14:34:29,073 INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 3 new nodes for launch
2023-06-28 14:34:29,073 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 3 nodes to launch.
2023-06-28 14:34:29,073 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 3 nodes, type large-group.
2023-06-28 14:34:29,074 INFO node_provider.py:287 -- Autoscaler is submitting the following patch to RayCluster ray in namespace pm-wuy-research.
2023-06-28 14:34:29,074 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 3}]
2023-06-28 14:34:29,119 INFO autoscaler.py:471 -- The autoscaler took 0.112 seconds to complete the update iteration.
2023-06-28 14:34:29,120 INFO monitor.py:425 -- :event_summary:Resized to 176 CPUs.
2023-06-28 14:34:29,120 INFO monitor.py:425 -- :event_summary:Adding 3 node(s) of type large-group.
2023-06-28 14:34:34,184 INFO node_provider.py:240 -- Listing pods for RayCluster ray in namespace pm-wuy-research at pods resource version >= 42434271.
2023-06-28 14:34:34,210 INFO node_provider.py:258 -- Fetched pod data at resource version 42434271.
2023-06-28 14:34:34,210 INFO autoscaler.py:148 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-06-28 14:34:34,211 INFO autoscaler.py:427 -- 

Versions / Dependencies

ray 2.5.1

Reproduction script

normal setup

Issue Severity

Low: It annoys or frustrates me.

dcarrion87 commented 1 year ago

@wjzhou-ep @scv119

We seem to be seeing this behaviour too. Maybe slightly different. But essentially the autoscaler is starting more nodes than necessary.

Initial request for 24x:

 {'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:01,163 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 24 new nodes for launch
2023-07-04 22:44:01,164 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 24 nodes to launch.
2023-07-04 22:44:01,164 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 24 nodes, type workergroup.
2023-07-04 22:44:01,164 INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:01,164 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 24}]
2023-07-04 22:44:01,193 INFO autoscaler.py:462 -- The autoscaler took 0.075 seconds to complete the update iteration.
2023-07-04 22:44:01,193 INFO monitor.py:428 -- :event_summary:Adding 24 node(s) of type workergroup.
2023-07-04 22:44:06,283 INFO node_provider.py:257 -- Fetched pod data at resource version 210040046.
2023-07-04 22:44:06,283 INFO autoscaler.py:143 -- The autoscaler took 0.056 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:06,284 INFO autoscaler.py:419 -- 

A few loops later it starts queuing unnecessary nodes.

 {'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:16,577 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 2 new nodes for launch
2023-07-04 22:44:16,577 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 2 nodes to launch.
2023-07-04 22:44:16,577 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 2 nodes, type workergroup.
2023-07-04 22:44:16,577 INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:16,577 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 26}]
2023-07-04 22:44:16,608 INFO autoscaler.py:462 -- The autoscaler took 0.097 seconds to complete the update iteration.
2023-07-04 22:44:16,609 INFO monitor.py:428 -- :event_summary:Resized to 64 CPUs, 2 GPUs.
2023-07-04 22:44:16,609 INFO monitor.py:428 -- :event_summary:Adding 2 node(s) of type workergroup.
2023-07-04 22:44:21,724 INFO node_provider.py:257 -- Fetched pod data at resource version 210040402.
2023-07-04 22:44:21,724 INFO autoscaler.py:143 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:21,725 INFO autoscaler.py:419 -- 

And then some more

 {'CPU': 32.0, 'GPU': 1.0}: 16+ pending tasks/actors
2023-07-04 22:44:32,167 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 6 new nodes for launch
2023-07-04 22:44:32,167 INFO node_launcher.py:166 -- BaseNodeLauncher: Got 6 nodes to launch.
2023-07-04 22:44:32,167 INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 6 nodes, type workergroup.
2023-07-04 22:44:32,168 INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-senorchang-ray-1688535753 in namespace anon-coder-prod.
2023-07-04 22:44:32,168 INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 32}]
2023-07-04 22:44:32,203 INFO autoscaler.py:462 -- The autoscaler took 0.176 seconds to complete the update iteration.
2023-07-04 22:44:32,204 INFO monitor.py:428 -- :event_summary:Resized to 512 CPUs, 16 GPUs.
2023-07-04 22:44:32,204 INFO monitor.py:428 -- :event_summary:Adding 6 node(s) of type workergroup.
2023-07-04 22:44:37,303 INFO node_provider.py:257 -- Fetched pod data at resource version 210040797.
2023-07-04 22:44:37,304 INFO autoscaler.py:143 -- The autoscaler took 0.067 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:37,305 INFO autoscaler.py:419 -- 

We don't understand why?

The issue for us is they pend forever due to constraints around the pods and adding to resource quota counts which aren't valid.

Attached logs: autoscaler.txt

Naton1 commented 1 year ago

Seeing the same as well.

No jobs running...

======== Autoscaler status: 2023-07-18 02:18:09.914151 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/24.0 CPU
 0B/29.45GiB memory
 0B/12.47GiB object_store_memory

Demands:
 (no resource demands)
2023-07-18 02:18:09,916 INFO autoscaler.py:470 -- The autoscaler took 0.065 seconds to complete the update iteration.
2023-07-18 02:18:14,984 INFO autoscaler.py:147 -- The autoscaler took 0.053 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:14,985 INFO autoscaler.py:427 -- 

Then 5x 4 CPUs jobs started with 24 CPUs already available, 4 more instances launched unnecessarily

======== Autoscaler status: 2023-07-18 02:18:14.985085 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 20.0/24.0 CPU
 0B/29.45GiB memory
 0B/12.47GiB object_store_memory

Demands:
 {'CPU': 4.0}: 5+ pending tasks/actors
2023-07-18 02:18:14,987 INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 4 new nodes for launch
2023-07-18 02:18:14,987 INFO autoscaler.py:470 -- The autoscaler took 0.056 seconds to complete the update iteration.
2023-07-18 02:18:14,988 INFO node_launcher.py:166 -- NodeLauncher0: Got 4 nodes to launch.
2023-07-18 02:18:16,976 INFO node_launcher.py:166 -- NodeLauncher0: Launching 4 nodes, type worker_node.
2023-07-18 02:18:20,149 INFO autoscaler.py:147 -- The autoscaler took 0.141 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:20,150 INFO autoscaler.py:427 -- 
======== Autoscaler status: 2023-07-18 02:18:20.150147 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 172.31.3.147: worker_node, uninitialized
 172.31.14.20: worker_node, uninitialized
 172.31.5.8: worker_node, uninitialized
 172.31.13.158: worker_node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 20.0/24.0 CPU
 0B/29.45GiB memory
 74.59MiB/12.47GiB object_store_memory

Demands:
 (no resource demands)
2023-07-18 02:18:20,152 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0231fa18511c6f83f.
2023-07-18 02:18:20,152 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0970cd9533d7fb8fe.
2023-07-18 02:18:20,153 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0c5c4df3cac069b0e.
2023-07-18 02:18:20,154 INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0db0afd004f36a21a.
2023-07-18 02:18:20,155 INFO autoscaler.py:470 -- The autoscaler took 0.147 seconds to complete the update iteration.
2023-07-18 02:18:20,156 INFO monitor.py:423 -- :event_summary:Adding 4 node(s) of type worker_node.
2023-07-18 02:18:25,313 INFO autoscaler.py:147 -- The autoscaler took 0.136 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:25,314 INFO autoscaler.py:427 -- 
meastham commented 1 year ago

I'm also seeing this in a cluster I started on AWS EC2 instances with ray up. There's an additional wrinkle in my case where the "extra" worker also is oversized for the work that's actually in the queue.

The config (I've put the whole thing at the end of this comment) includes multiple node types, two of which are ray.worker.ray-dev-r6a.2xlarge and ray.worker.ray-dev-r6a.4xlarge which are configured in the autoscaler with 47.8 GiB and 89.6 GiB of memory respectively. I start with just the head node running, which has a worker too small for the workload I'm testing. Then start a workload with a requirement of 35 GiB of memory. At first a 2xlarge worker node starts as expected, but right as it finishes start the autoscaler also starts a 4xlarge node which is both unnecessary and too large. The job ends up scheduling on the 2xlarge as expected and the 4xlarge node just ends up shutting down after being idle.

Autoscaler log:

======== Autoscaler status: 2023-07-20 07:29:54.064934 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.ray-dev
Pending:
 10.128.3.31: ray.worker.ray-dev-r6a.2xlarge, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0B/4.36GiB memory
 0B/2.18GiB object_store_memory

Demands:
 {'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:54,066TINFO autoscaler.py:470 -- The autoscaler took 0.12 seconds to complete the update iteration.
2023-07-20 07:29:59,145TINFO autoscaler.py:147 -- The autoscaler took 0.051 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:29:59,146TINFO autoscaler.py:427 --
======== Autoscaler status: 2023-07-20 07:29:59.146404 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.ray-dev
 1 ray.worker.ray-dev-r6a.2xlarge
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 4.0/10.0 CPU
 35.00GiB/49.16GiB memory
 0B/20.60GiB object_store_memory

Demands:
 {'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:59,147TINFO autoscaler.py:1374 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-07-20 07:29:59,147TINFO autoscaler.py:470 -- The autoscaler took 0.053 seconds to complete the update iteration.
2023-07-20 07:29:59,148TINFO node_launcher.py:166 -- NodeLauncher1: Got 1 nodes to launch.
2023-07-20 07:29:59,148TINFO monitor.py:423 -- :event_summary:Resized to 10 CPUs.
2023-07-20 07:30:00,403TINFO node_launcher.py:166 -- NodeLauncher1: Launching 1 nodes, type ray.worker.ray-dev-r6a.4xlarge.
2023-07-20 07:30:04,291TINFO autoscaler.py:147 -- The autoscaler took 0.117 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:30:04,291TINFO autoscaler.py:427 --

Full cluster config:

auth: {ssh_user: ubuntu}
available_node_types:
  ray.head.ray-dev:
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs: {VolumeSize: 140, VolumeType: gp3}
      ImageId: ami-0387d929287ab193e
      InstanceType: m5.large
    resources: {}
  ray.worker.ray-dev-r6a.12xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.12xlarge
  ray.worker.ray-dev-r6a.16xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.16xlarge
  ray.worker.ray-dev-r6a.24xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.24xlarge
  ray.worker.ray-dev-r6a.2xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.2xlarge
  ray.worker.ray-dev-r6a.32xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.32xlarge
  ray.worker.ray-dev-r6a.4xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.4xlarge
  ray.worker.ray-dev-r6a.8xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.8xlarge
  ray.worker.ray-dev-r6a.large:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.large
  ray.worker.ray-dev-r6a.xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.xlarge
cluster_name: default
cluster_synced_files: []
docker:
  container_name: ray_container
  image: rayproject/ray:2.5.1-py39-cpu
  pull_before_run: true
  run_options: ['--ulimit nofile=65536:65536']
file_mounts: {}
file_mounts_sync_continuously: false
head_node_type: ray.head.ray-dev
head_setup_commands: []
head_start_ray_commands: [ray stop, ray start --head --port=6379 --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0]
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 5
provider: {availability_zone: 'us-west-2a,us-west-2b,us-west-2c,us-west-2d', cache_stopped_nodes: true,
  region: us-west-2, type: aws}
rsync_exclude: ['**/.git', '**/.git/**']
rsync_filter: [.gitignore]
setup_commands: []
upscaling_speed: 1.0
worker_setup_commands: []
worker_start_ray_commands: [ray stop, 'ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076']
dcarrion87 commented 1 year ago

Just wondering if anyone has figured out what's happening here? Still happening to us. As we constrain nodes it's just leaving pending pods littered throughout the different clusters.

wjzhou-ep commented 1 year ago

as I said in the my original post, I believe there is a race condition on the reading of pending task and Usage:

The Usage detects the running tasks (after node starts) But the Pending task was the old value, so, cluster scale up extra nodes for these Usage + Pending, (The Usage tasks were counted twice, once in Usage and once in the pending )

dcarrion87 commented 1 year ago

Do we know when this is likely to come in as a fix?

anyscalesam commented 1 year ago

cc @rickyyx can you try and repro and see if this is fixed since ray27 release in the v2 oss autoscaler?

rickyyx commented 1 year ago

2.7 should have patches that mitigate this - but this is essentially this https://github.com/ray-project/ray/issues/38189 as well.

Current plan is to fix in 2.8.

rickyyx commented 11 months ago

https://github.com/ray-project/ray/pull/40254 should close this.

anyscalesam commented 10 months ago

@vitsai > chasing the breadcrumbs > which is the most promising GH issue / PR that we think will resolve this issue?

vitsai commented 10 months ago

https://github.com/ray-project/ray/pull/40488

vitsai commented 10 months ago

The fix for autoscaler v2 is in 2.8, the linked PR is for autoscaler v1.

rkooo567 commented 10 months ago

I will downgrade the priority as V1 fix is less prioritized. Does it sound okay?

anyscalesam commented 10 months ago

Reviewed with @rkooo567 @rynewang @vitsai > let's decide whether we should fix this in autoscaler v1.

This is fixed in v2 but we have an interim state in ray29 where default may still be autosclaer v1 in which case this issue will be there.

Next steps lets decide do we just skip/[ush to autoscaler v2 in ray210 or fix this regression.

llidev commented 8 months ago

@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc

llidev commented 8 months ago

Is it just enabling RAY_enable_autoscaler_v2=1?

rickyyx commented 8 months ago

@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc

Hey @llidev we are still working on the fix with autoscaler v1. @vitsai has a PR here https://github.com/ray-project/ray/pull/40488: , while we try doing so, we are also working on v2 autoscaler. This has been delayed due to other priority, so it's not available with ray29 yet.

llidev commented 8 months ago

Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.

rickyyx commented 8 months ago

Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.

It's still under active dev - so still not yet ready.

rkooo567 commented 8 months ago

We will close it after autoscaler v2 is enabled

DmitriGekhtman commented 3 weeks ago

Any progress on this one?

anyscalesam commented 3 weeks ago

@DmitriGekhtman - not yet; v2 is still optional and not the default scaler for now.

DmitriGekhtman commented 3 weeks ago

Hmm, looks like we're in a state where autoscaler v1 functionality is gradually degrading but autoscaler v2 development is suspended (last feature commit was in March). (Totally understandable, I'm sure the maintainers have a lot on their hands.) Might proceed cautiously with adopting Ray autoscaling and try to collaborate on stability fixes where possible.