Closed amholler closed 2 years ago
Asking Anne to help repro it on single Tune session setup.
My plan is to run 2 Ray Tune sessions serially, rather than in parallel, and check that idle workers are reclaimed.
And note that if the problem doesn't reproduce in the serial scenario, we could switch the blog to using serial Ray Tune sessions while we figure out what's the problem in scale down with parallel Ray Tune sessions.
I can reproduce the problem with a single Ray Tune job running by itself on an empty Ray Cluster.
Idle workers are not spun down until the active Ray Tune job completes, even though they were idle
for more than 30 minutes. Details below.
Setup: Ray 1.9.1 cluster w/CPU-head & 0-9 GPU workers, 0 workers running initially, idleTimeoutMinutes: 3
Experiment: Start AutoML job on ludwig tf-legacy using this script:
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/forest_cover/run_auto_train_1hr_nodeless.py
which specifies max_concurrent 3. AutoML runs 10 trials.
Selected Time-line highlighting problem:
07:46 forest_cover dataset fetch starts
07:47 forest_cover Ray Tune train starts, with async hyperband to terminate unpromising trials and 1 hr time budget.
07:53 Ray Autoscaler Ray start commands succeeded for example-cluster-ray-worker-w8jw7 [10.0.1.221]
07:59 Ray Autoscaler Ray start commands succeeded for example-cluster-ray-worker-l7rzf [10.0.6.169]
08:04 Ray Autoscaler Ray start commands succeeded for example-cluster-ray-worker-vpckp [10.0.3.241]
08:05 3 trials running, 1 terminated
Busy nodes: 10.0.1.221, 10.0.6.169, 10.0.3.241
Idle nodes: none
08:13 2 trials running, 8 terminated; 10/10 -> no more trials will be started
Busy nodes: 10.0.1.221, 10.0.6.169
Idle nodes: 10.0.3.241
08:14 1 trial running, 9 terminated; 10/10 -> no more trials will be started
Busy nodes: 10.0.1.221
Idle nodes: 10.0.3.241, 10.0.6.169
08:47 forest_cover Ray Tune Job ended
08:53 Ray Autoscaler logged worker termination for:
10.0.6.169 example-cluster-ray-worker-l7rzf
10.0.3.241 example-cluster-ray-worker-vpckp
10.0.1.221 example-cluster-ray-worker-w8jw7
So 10.0.3.241,10.0.6.169 were idle from 8:14 to the end of the run at 8:47, but weren't spun down until 8:53,
i.e., after the active Ray Tune Job ended.
BTW: Note that this means that running the Ray Tune jobs serially will not workaround this problem, and fixing this is
critical to seeing better idle node reclamation performance from Autoscaling.
Thanks Anne for the experiment.
Could you maybe print placement_group_table
in the gap period (i.e. during the time of 2/1 trials running, 8/9 terminated
)? This would give us an idea of the placement groups requested.
Also @krfricke
Here is an example ray status right after 8:14. Interestingly, it turns out an additional worker got added to the ray cluster that never got used by the Ray Tune job.
======== Autoscaler status: 2022-01-26 08:14:08.351179 ========
Node status
---------------------------------------------------------------
Healthy:
1 head-node
4 worker-node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
4.0/35.0 CPU (4.0 used of 4.0 reserved in placement groups)
1.0/4.0 GPU (1.0 used of 1.0 reserved in placement groups)
0.0/4.0 accelerator_type:T4
0.0/4.0 example-resource-a
0.0/4.0 example-resource-b
0.00/113.400 GiB memory
0.13/48.350 GiB object_store_memory
Right after that, in the autoscaler logs we see:
example-cluster,default:2022-01-26 08:14:08,434 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 4 nodes\n - MostDelayedHeartbeats: {'10.0.1.221': 0.3186984062194824, '10.0.3.143': 0.3186213970184326, '10.0.3.241': 0.3185453414916992, '10.0.2.231': 0.31847691535949707, '10.0.6.169': 0.3184034824371338}\n - NodeIdleSeconds: Min=0 Mean=10 Max=51\n - ResourceUsage: 4.0/35.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/4.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 example-resource-a, 0.0/4.0 example-resource-b, 0.0 GiB/113.4 GiB memory, 0.13 GiB/48.35 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - worker-node: 4" True None
example-cluster,default:2022-01-26 08:14:08,435 DEBUG legacy_info_string.py:24 -- Cluster status: 4 nodes
- MostDelayedHeartbeats: {'10.0.1.221': 0.3186984062194824, '10.0.3.143': 0.3186213970184326, '10.0.3.241': 0.3185453414916992, '10.0.2.231': 0.31847691535949707, '10.0.6.169': 0.3184034824371338}
- NodeIdleSeconds: Min=0 Mean=10 Max=51
- ResourceUsage: 4.0/35.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/4.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 example-resource-a, 0.0/4.0 example-resource-b, 0.0 GiB/113.4 GiB memory, 0.13 GiB/48.35 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Hey, I see something that could be interesting; an unexpected (to me) reset of NodeIdleSeconds. Could this be associated with the problem?
At 2022-01-26 08:13:23,598 we see this which seems reasonable:
- NodeIdleSeconds: Min=0 Mean=1 Max=6
The max idle time goes up every 6 seconds as you would expect:
- NodeIdleSeconds: Min=0 Mean=2 Max=12
- NodeIdleSeconds: Min=0 Mean=4 Max=19
- NodeIdleSeconds: Min=0 Mean=5 Max=25
- NodeIdleSeconds: Min=0 Mean=6 Max=31
- NodeIdleSeconds: Min=0 Mean=7 Max=38
- NodeIdleSeconds: Min=0 Mean=9 Max=44
- NodeIdleSeconds: Min=0 Mean=10 Max=51
- NodeIdleSeconds: Min=0 Mean=11 Max=57
- NodeIdleSeconds: Min=0 Mean=13 Max=64
- NodeIdleSeconds: Min=0 Mean=14 Max=70
- NodeIdleSeconds: Min=0 Mean=15 Max=76
- NodeIdleSeconds: Min=0 Mean=16 Max=83
- NodeIdleSeconds: Min=0 Mean=18 Max=89
- NodeIdleSeconds: Min=0 Mean=19 Max=95
- NodeIdleSeconds: Min=0 Mean=20 Max=101
- NodeIdleSeconds: Min=0 Mean=21 Max=108
- NodeIdleSeconds: Min=0 Mean=23 Max=114
- NodeIdleSeconds: Min=0 Mean=24 Max=121
- NodeIdleSeconds: Min=0 Mean=25 Max=127
- NodeIdleSeconds: Min=0 Mean=27 Max=133
- NodeIdleSeconds: Min=0 Mean=28 Max=140
- NodeIdleSeconds: Min=0 Mean=29 Max=146
- NodeIdleSeconds: Min=0 Mean=30 Max=152
- NodeIdleSeconds: Min=0 Mean=32 Max=159
- NodeIdleSeconds: Min=0 Mean=33 Max=165
- NodeIdleSeconds: Min=0 Mean=34 Max=171
- NodeIdleSeconds: Min=0 Mean=35 Max=178
- NodeIdleSeconds: Min=0 Mean=37 Max=184
Then the extra worker that was brought up but never used by Ray Tune is removed from the cluster.
And then we see this at 2022-01-26 08:16:27,886:
- NodeIdleSeconds: Min=0 Mean=0 Max=0
and these values stay stuck at all 0s for the rest of the Ray Tune run. It is like the
removal of the unused node caused all the node idle history for the workers that
participated in the Ray Tune job to be wiped clean.
At 2022-01-26 08:50:23,208, after the job ends, the values climb up again and the
associated nodes are removed.
Could you maybe print placement_group_table in the gap period (i.e. during the time of 2/1 trials running, 8/9 terminated)? This would give us an idea of the placement groups requested.
Here is an example of what I think you want from the period when only 1 trial is running
example-cluster,default:2022-01-26 08:15:56,569 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
example-cluster,default:2022-01-26 08:15:56,569 DEBUG resource_demand_scheduler.py:247 -- Resource demands: [{'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}]
example-cluster,default:2022-01-26 08:15:56,569 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:15:56,569 WARNING resource_demand_scheduler.py:746 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
example-cluster,default:2022-01-26 08:15:56,570 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:15:56,687 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
example-cluster,default:2022-01-26 08:15:56,802 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"node:10.0.3.143": [0.0, 1.0], "memory": [0.0, 121762322840.0], "CPU": [4.0, 35.0], "object_store_memory": [144143388.0, 51915278743.0], "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "example-resource-b": [0.0, 4.0], "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "bundle_group_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "accelerator_type:T4": [0.0, 4.0], "node:10.0.1.221": [0.0, 1.0], "GPU": [1.0, 4.0], "example-resource-a": [0.0, 4.0], "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "node:10.0.6.169": [0.0, 1.0], "node:10.0.3.241": [0.0, 1.0], "node:10.0.2.231": [0.0, 1.0]}, "resource_demand": [[{"bundle_group_c6b7c78b2813e3467c825ee335e91f59": 0.001}, 1], [{"bundle_group_bf6d839e903a2bcef6e25bd54b65fc50": 0.001}, 1], [{"bundle_group_3116b209b405aecc460bd44c166069fc": 0.001}, 1], [{"bundle_group_d67d3fcd68c89cea58085c5b1b191070": 0.001}, 1], [{"bundle_group_f2fd2a403e6cf780454857a52f5f504e": 0.001}, 1], [{"bundle_group_7d793a59e156378b15521f0b7c7b04bb": 0.001}, 1]], "pg_demand": [], "request_demand": [], "node_types": [[{"object_store_memory": 16082992742.0, "CPU": 7.0, "memory": 37580963840.0, "node:10.0.3.143": 1.0}, 1], [{"bundle_group_c9c5bfa85076b90014e73f72213fa4c4": 1000.0, "node:10.0.1.221": 1.0, "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "CPU": 7.0, "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "memory": 21045339750.0, "accelerator_type:T4": 1.0, "object_store_memory": 8957606707.0, "example-resource-a": 1.0, "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1000.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"object_store_memory": 8958260428.0, "CPU": 7.0, "memory": 21045339750.0, "node:10.0.6.169": 1.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"object_store_memory": 8958321868.0, "CPU": 7.0, "memory": 21045339750.0, "example-resource-a": 1.0, "node:10.0.3.241": 1.0, "accelerator_type:T4": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"CPU": 7.0, "object_store_memory": 8958096998.0, "memory": 21045339750.0, "node:10.0.2.231": 1.0, "example-resource-a": 1.0, "accelerator_type:T4": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1]], "head_ip": null}, "time": 1643213755.5013282, "monitor_pid": 7054, "autoscaler_report": {"active_nodes": {"head-node": 1, "worker-node": 4}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
@wuisawesome Can you take a look as Dataplane oncall? I suspect this is not a Tune issue (asking Anne to print out placement_group_table
to verify) but an autoscaler issue (See Anne's comments above). Can you advise on how to proceed?
@amholler Just saw your last comment. I am not very familiar with this output. @rkooo567 to see if there is anything suspicious (especially the resource demand session, I saw a lot of 0.001 placeholder. @rkooo567 is that concerning?). Json beautified result:
{
"load_metrics_report": {
"usage": {
"node:10.0.3.143": [
0,
1
],
"memory": [
0,
121762322840
],
"CPU": [
4,
35
],
"object_store_memory": [
144143388,
51915278743
],
"bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": [
0,
1000
],
"example-resource-b": [
0,
4
],
"CPU_group_c9c5bfa85076b90014e73f72213fa4c4": [
4,
4
],
"bundle_group_c9c5bfa85076b90014e73f72213fa4c4": [
0,
1000
],
"GPU_group_c9c5bfa85076b90014e73f72213fa4c4": [
1,
1
],
"GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [
1,
1
],
"accelerator_type:T4": [
0,
4
],
"node:10.0.1.221": [
0,
1
],
"GPU": [
1,
4
],
"example-resource-a": [
0,
4
],
"CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [
4,
4
],
"node:10.0.6.169": [
0,
1
],
"node:10.0.3.241": [
0,
1
],
"node:10.0.2.231": [
0,
1
]
},
"resource_demand": [
[
{
"bundle_group_c6b7c78b2813e3467c825ee335e91f59": 0.001
},
1
],
[
{
"bundle_group_bf6d839e903a2bcef6e25bd54b65fc50": 0.001
},
1
],
[
{
"bundle_group_3116b209b405aecc460bd44c166069fc": 0.001
},
1
],
[
{
"bundle_group_d67d3fcd68c89cea58085c5b1b191070": 0.001
},
1
],
[
{
"bundle_group_f2fd2a403e6cf780454857a52f5f504e": 0.001
},
1
],
[
{
"bundle_group_7d793a59e156378b15521f0b7c7b04bb": 0.001
},
1
]
],
"pg_demand": [],
"request_demand": [],
"node_types": [
[
{
"object_store_memory": 16082992742,
"CPU": 7,
"memory": 37580963840,
"node:10.0.3.143": 1
},
1
],
[
{
"bundle_group_c9c5bfa85076b90014e73f72213fa4c4": 1000,
"node:10.0.1.221": 1,
"CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 4,
"CPU": 7,
"GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1,
"CPU_group_c9c5bfa85076b90014e73f72213fa4c4": 4,
"memory": 21045339750,
"accelerator_type:T4": 1,
"object_store_memory": 8957606707,
"example-resource-a": 1,
"GPU_group_c9c5bfa85076b90014e73f72213fa4c4": 1,
"bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1000,
"GPU": 1,
"example-resource-b": 1
},
1
],
[
{
"object_store_memory": 8958260428,
"CPU": 7,
"memory": 21045339750,
"node:10.0.6.169": 1,
"accelerator_type:T4": 1,
"example-resource-a": 1,
"GPU": 1,
"example-resource-b": 1
},
1
],
[
{
"object_store_memory": 8958321868,
"CPU": 7,
"memory": 21045339750,
"example-resource-a": 1,
"node:10.0.3.241": 1,
"accelerator_type:T4": 1,
"GPU": 1,
"example-resource-b": 1
},
1
],
[
{
"CPU": 7,
"object_store_memory": 8958096998,
"memory": 21045339750,
"node:10.0.2.231": 1,
"example-resource-a": 1,
"accelerator_type:T4": 1,
"GPU": 1,
"example-resource-b": 1
},
1
]
],
"head_ip": null
},
"time": 1643213755.5013282,
"monitor_pid": 7054,
"autoscaler_report": {
"active_nodes": {
"head-node": 1,
"worker-node": 4
},
"pending_nodes": [],
"pending_launches": {},
"failed_nodes": []
}
}
@amholler I was thinking if you could use ray.util.placement_group_table()
to print out the resources at that moment when this is happening. Sample output looks like
{'23e6d966b7b3748eacf8d0dc19fd54a5': {'bundles': {0: {'CPU': 1.0},
1: {'CPU': 4.0}},
'name': '__tune_83e72cbd__a565778b',
'placement_group_id': '23e6d966b7b3748eacf8d0dc19fd54a5',
'state': 'REMOVED',
'stats': {'end_to_end_creation_latency_ms': 0.0,
'highest_retry_delay_ms': 1000.0,
'scheduling_attempt': 236,
'scheduling_latency_ms': 0.0,
'scheduling_state': 'REMOVED'},
'strategy': 'PACK'},
'23f1bf68e9a140e0e8326ba9bf60c573': {'bundles': {0: {'CPU': 1.0},
1: {'CPU': 4.0}},
'name': '__tune_3f706cb3__e2694e22',
'placement_group_id': '23f1bf68e9a140e0e8326ba9bf60c573',
'state': 'PENDING',
'stats': {'end_to_end_creation_latency_ms': 0.0,
'highest_retry_delay_ms': 1000.0,
'scheduling_attempt': 365,
'scheduling_latency_ms': 0.0,
'scheduling_state': 'NO_RESOURCES'},
'strategy': 'PACK'}}
It seems like a key clue at this point is the odd reset-and-stuck-at-zero of NodeIdleSeconds described by me earlier in this ticket, so I would prefer not to do additional runs until we are able to see if they are worthwhile, given this information.
Thoughts @ericl ?
@DmitriGekhtman can you take a look here? It seems to be a autoscaling/scheduling bug.
I think the key area of the autoscaler log is below, where NodeIdleSeconds goes to 0 after the idle node is removed:
example-cluster,default:2022-01-26 08:16:21,639 DEBUG legacy_info_string.py:24 -- Cluster status: 4 nodes
- MostDelayedHeartbeats: {'10.0.1.221': 0.3429567813873291, '10.0.3.143': 0.3428807258605957, '10.0.3.241': 0.3428037166595459, '10.0.2.231': 0.34258198738098145, '10.0.6.169': 0.34249281883239746}
- NodeIdleSeconds: Min=0 Mean=37 Max=184
- ResourceUsage: 4.0/35.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/4.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 example-resource-a, 0.0/4.0 example-resource-b, 0.0 GiB/113.4 GiB memory, 0.13 GiB/48.35 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- worker-node: 4
example-cluster,default:2022-01-26 08:16:21,856 INFO autoscaler.py:437 -- StandardAutoscaler: Terminating the node with id example-cluster-ray-worker-5c86z and ip 10.0.2.231. (idle)
example-cluster,default:2022-01-26 08:16:21,885 INFO autoscaler.py:500 -- Draining 1 raylet(s).
example-cluster,default:2022-01-26 08:16:21,886 INFO node_provider.py:171 -- KubernetesNodeProvider: calling delete_namespaced_pod
example-cluster,default:2022-01-26 08:16:21,950 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-l7rzf is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:21,988 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-vpckp is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,036 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-w8jw7 is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,091 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-l7rzf is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,174 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-vpckp is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,206 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-w8jw7 is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,383 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'memory': 37580963840.0, 'node:10.0.3.143': 1.0, 'CPU': 7.0, 'object_store_memory': 16046956895.0}, {'object_store_memory': 8922224581.0, 'CPU': 7.0, 'memory': 21045339750.0, 'GPU': 1.0, 'accelerator_type:T4': 1.0, 'example-resource-b': 1.0, 'example-resource-a': 1.0, 'node:10.0.6.169': 1.0}, {'CPU': 7.0, 'GPU': 1.0, 'accelerator_type:T4': 1.0, 'memory': 21045339750.0, 'example-resource-a': 1.0, 'node:10.0.3.241': 1.0, 'example-resource-b': 1.0, 'object_store_memory': 8922286021.0}, {'memory': 21045339750.0, 'example-resource-a': 1.0, 'bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4': 1000.0, 'example-resource-b': 1.0, 'node:10.0.1.221': 1.0, 'accelerator_type:T4': 1.0, 'CPU': 3.0, 'bundle_group_c9c5bfa85076b90014e73f72213fa4c4': 1000.0, 'object_store_memory': 8921570860.0, 'GPU': 0.0, 'GPU_group_c9c5bfa85076b90014e73f72213fa4c4': 0.0, 'CPU_group_c9c5bfa85076b90014e73f72213fa4c4': 0.0, 'GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4': 0.0, 'CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4': 0.0}]
example-cluster,default:2022-01-26 08:16:22,383 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'head-node': 1, 'worker-node': 3})
example-cluster,default:2022-01-26 08:16:22,383 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
example-cluster,default:2022-01-26 08:16:22,384 DEBUG resource_demand_scheduler.py:247 -- Resource demands: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}]
example-cluster,default:2022-01-26 08:16:22,384 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:16:22,384 WARNING resource_demand_scheduler.py:746 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
example-cluster,default:2022-01-26 08:16:22,384 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:16:22,460 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
example-cluster,default:2022-01-26 08:16:22,542 INFO monitor.py:331 -- :event_summary:Removing 1 nodes of type worker-node (idle).
example-cluster,default:2022-01-26 08:16:22,542 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"object_store_memory": [144143388.0, 51915278743.0], "node:10.0.3.143": [0.0, 1.0], "memory": [0.0, 121762322840.0], "CPU": [4.0, 35.0], "GPU": [1.0, 4.0], "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "example-resource-b": [0.0, 4.0], "accelerator_type:T4": [0.0, 4.0], "bundle_group_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "example-resource-a": [0.0, 4.0], "node:10.0.1.221": [0.0, 1.0], "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "node:10.0.6.169": [0.0, 1.0], "node:10.0.3.241": [0.0, 1.0], "node:10.0.2.231": [0.0, 1.0]}, "resource_demand": [[{"bundle_group_f2fd2a403e6cf780454857a52f5f504e": 0.001}, 1], [{"bundle_group_3116b209b405aecc460bd44c166069fc": 0.001}, 1], [{"bundle_group_d67d3fcd68c89cea58085c5b1b191070": 0.001}, 1], [{"bundle_group_c6b7c78b2813e3467c825ee335e91f59": 0.001}, 1], [{"bundle_group_bf6d839e903a2bcef6e25bd54b65fc50": 0.001}, 1], [{"bundle_group_7d793a59e156378b15521f0b7c7b04bb": 0.001}, 1]], "pg_demand": [], "request_demand": [], "node_types": [[{"object_store_memory": 16082992742.0, "CPU": 7.0, "memory": 37580963840.0, "node:10.0.3.143": 1.0}, 1], [{"bundle_group_c9c5bfa85076b90014e73f72213fa4c4": 1000.0, "node:10.0.1.221": 1.0, "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "CPU": 7.0, "example-resource-b": 1.0, "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "memory": 21045339750.0, "accelerator_type:T4": 1.0, "object_store_memory": 8957606707.0, "example-resource-a": 1.0, "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "GPU": 1.0, "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1000.0}, 1], [{"object_store_memory": 8958260428.0, "CPU": 7.0, "memory": 21045339750.0, "node:10.0.6.169": 1.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"object_store_memory": 8958321868.0, "CPU": 7.0, "memory": 21045339750.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "node:10.0.3.241": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"CPU": 7.0, "object_store_memory": 8958096998.0, "memory": 21045339750.0, "node:10.0.2.231": 1.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1]], "head_ip": null}, "time": 1643213781.1970084, "monitor_pid": 7054, "autoscaler_report": {"active_nodes": {"head-node": 1, "worker-node": 3}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
example-cluster,default:2022-01-26 08:16:27,545 DEBUG gcs_utils.py:228 -- internal_kv_get b'autoscaler_resource_request' None
example-cluster,default:2022-01-26 08:16:27,717 INFO load_metrics.py:171 -- LoadMetrics: Removed ip: 10.0.2.231.
example-cluster,default:2022-01-26 08:16:27,718 INFO load_metrics.py:177 -- LoadMetrics: Removed 1 stale ip mappings: {'10.0.2.231'} not in {'10.0.6.169', '10.0.3.241', '10.0.1.221', '10.0.3.143'}
example-cluster,default:2022-01-26 08:16:27,820 INFO autoscaler.py:303 --
======== Autoscaler status: 2022-01-26 08:16:27.820417 ========
Node status
---------------------------------------------------------------
Healthy:
1 head-node
3 worker-node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
4.0/28.0 CPU (4.0 used of 4.0 reserved in placement groups)
1.0/3.0 GPU (1.0 used of 1.0 reserved in placement groups)
0.0/3.0 accelerator_type:T4
0.0/3.0 example-resource-a
0.0/3.0 example-resource-b
0.00/93.800 GiB memory
0.13/40.007 GiB object_store_memory
Demands:
(no resource demands)
example-cluster,default:2022-01-26 08:16:27,886 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 3 nodes\n - MostDelayedHeartbeats: {'10.0.1.221': 0.2756786346435547, '10.0.3.143': 0.2756021022796631, '10.0.3.241': 0.2755157947540283, '10.0.6.169': 0.27544498443603516}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/28.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/3.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 example-resource-a, 0.0/3.0 example-resource-b, 0.0 GiB/93.8 GiB memory, 0.13 GiB/40.01 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - worker-node: 3" True None
example-cluster,default:2022-01-26 08:16:27,887 DEBUG legacy_info_string.py:24 -- Cluster status: 3 nodes
- MostDelayedHeartbeats: {'10.0.1.221': 0.2756786346435547, '10.0.3.143': 0.2756021022796631, '10.0.3.241': 0.2755157947540283, '10.0.6.169': 0.27544498443603516}
- NodeIdleSeconds: Min=0 Mean=0 Max=0
- ResourceUsage: 4.0/28.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/3.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 example-resource-a, 0.0/3.0 example-resource-b, 0.0 GiB/93.8 GiB memory, 0.13 GiB/40.01 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
- worker-node: 3
Maybe somehow after the removal of the idle ip, this if condition gets stuck triggering on updates for the remaining ips, until the Ray Tune job ends:
if ip not in self.last_used_time_by_ip or \
self.static_resources_by_ip[ip] != \
self.dynamic_resources_by_ip[ip]:
self.last_used_time_by_ip[ip] = now
One additional thing about my setup I just thought of is described below.
The Ray cluster deployed on the Kubernetes cluster is Ray 1.9.1, which is in the docker image. However, I am now reflecting on the fact that in a Kubernetes cluster, the ray operator contains the Ray Autoscaler. The ray operator I am running in the K8s cluster was deployed 51 days ago from master.
Whenever I want to deploy a new ray cluster, I undeploy/redeploy the existing ray cluster, but I leave the ray operator running as is.
I'm not sure how/whether the behavior I am seeing in this bug could be caused by this, but I wanted to mention it. LMK if you think I should change something and rerun.
Updating the operator image is a good idea. Could you share the configuration you are using to deploy?
diff --git a/deploy/components/example_cluster.yaml b/deploy/components/example_cluster.yaml
index 1513e8fde..d4359fb77 100644
--- a/deploy/components/example_cluster.yaml
+++ b/deploy/components/example_cluster.yaml
@@ -4,14 +4,14 @@ metadata:
name: example-cluster
spec:
# The maximum number of workers nodes to launch in addition to the head node.
- maxWorkers: 3
+ maxWorkers: 9
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscalingSpeed: 1.0
# If a node is idle for this many minutes, it will be removed.
- idleTimeoutMinutes: 5
+ idleTimeoutMinutes: 3
# Specify the pod type for the ray head node (as configured below).
headPodType: head-node
# Optionally, configure ports for the Ray head service.
@@ -35,6 +35,8 @@ spec:
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-head-
+ annotations:
+ node.elotl.co/instance-family-exclusions: "g3,g3s"
spec:
restartPolicy: Never
@@ -48,11 +50,25 @@ spec:
containers:
- name: ray-node
imagePullPolicy: Always
- image: rayproject/ray:latest
+ # image: rayproject/ray:latest
+ image: ludwigai/ludwig-ray-gpu:tf-legacy
# Do not change this command - it keeps the pod alive until it is
# explicitly killed.
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
+ env:
+ - name: AWS_ACCESS_KEY_ID
+ value: "<censored>"
+ - name: AWS_SECRET_ACCESS_KEY
+ value: "<censored>"
+ - name: AWS_DEFAULT_REGION
+ value: "us-west-2"
+ - name: TUNE_TRIAL_STARTUP_GRACE_PERIOD
+ value: "120.0"
+ - name: TUNE_TRIAL_RESULT_WAIT_TIME_S
+ value: "120"
+ - name: TUNE_STATE_REFRESH_PERIOD
+ value: "5"
ports:
- containerPort: 6379 # Redis port
- containerPort: 10001 # Used by Ray Client
@@ -67,9 +83,9 @@ spec:
name: dshm
resources:
requests:
- cpu: 1000m
- memory: 512Mi
- ephemeral-storage: 1Gi
+ cpu: 7
+ memory: 50Gi
+ ephemeral-storage: 64Gi
limits:
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
@@ -78,21 +94,25 @@ spec:
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
- memory: 512Mi
+ memory: 50Gi
- name: worker-node
# Minimum number of Ray workers of this Pod type.
- minWorkers: 2
+ minWorkers: 0
# Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
- maxWorkers: 3
+ maxWorkers: 9
# User-specified custom resources for use by Ray.
# (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
rayResources: {"example-resource-a": 1, "example-resource-b": 1}
+ setupCommands:
+ - pip install boto3==1.17.106 awscli==1.19.106 botocore==1.20.106 s3fs==2021.10.0 aiobotocore==1.4.2 s3transfer==0.4.0 fsspec==2021.10.0
podConfig:
apiVersion: v1
kind: Pod
metadata:
# The operator automatically prepends the cluster name to this field.
generateName: ray-worker-
+ annotations:
+ node.elotl.co/instance-family-exclusions: "g3,g3s"
spec:
restartPolicy: Never
volumes:
@@ -102,9 +122,23 @@ spec:
containers:
- name: ray-node
imagePullPolicy: Always
- image: rayproject/ray:latest
+ # image: rayproject/ray:latest
+ image: ludwigai/ludwig-ray-gpu:tf-legacy
command: ["/bin/bash", "-c", "--"]
args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
+ env:
+ - name: AWS_ACCESS_KEY_ID
+ value: "<censored>"
+ - name: AWS_SECRET_ACCESS_KEY
+ value: "<censored>"
+ - name: AWS_DEFAULT_REGION
+ value: "us-west-2"
+ - name: TUNE_TRIAL_STARTUP_GRACE_PERIOD
+ value: "120.0"
+ - name: TUNE_TRIAL_RESULT_WAIT_TIME_S
+ value: "120"
+ - name: TUNE_STATE_REFRESH_PERIOD
+ value: "5"
# This volume allocates shared memory for Ray to use for its plasma
# object store. If you do not provide this, Ray will fall back to
# /tmp which cause slowdowns if is not a shared memory volume.
@@ -113,9 +147,9 @@ spec:
name: dshm
resources:
requests:
- cpu: 1000m
- memory: 512Mi
- ephemeral-storage: 1Gi
+ cpu: 7
+ memory: 28Gi
+ ephemeral-storage: 64Gi
limits:
# The maximum memory that this pod is allowed to use. The
# limit will be detected by ray and split to use 10% for
@@ -124,7 +158,8 @@ spec:
# the object store size is not set manually, ray will
# allocate a very large object store in each pod that may
# cause problems for other pods.
- memory: 512Mi
+ memory: 28Gi
+ nvidia.com/gpu: 1 # requesting 1 GPU
# Commands to start Ray on the head node. You don't need to change this.
# Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
headStartRayCommands:
https://github.com/ray-project/ray/issues/21870#issuecomment-1023342183 The idle time of the unused node is 184 (just over the timeout of 3 minutes). There are 5 nodes in the cluster (4 workers plus 1 head). The reported mean of 37 is roughly 184 / 5, suggesting that all of the nodes that were ever involved in the workload were considered active. So at least we know there was no surprising interaction between the scale-down of a worker and the registered idle times of the rest of the workers.
Maybe somehow after the removal of the idle ip, this if condition gets stuck triggering on updates for the remaining ips, until the Ray Tune job ends:
if ip not in self.last_used_time_by_ip or \ self.static_resources_by_ip[ip] != \ self.dynamic_resources_by_ip[ip]: self.last_used_time_by_ip[ip] = now
This is where the last used time is determined.
Explanation:
"static resources" means "total available node resources" "dynamic resources" mean "currently available resources" (resources not currently in use by a Ray workload)
So the second condition in the "or" checks if there are resources that are currently in use. The first condition just marks the node as in use if we've never seen it before. That happens once per node lifetime.
The second condition in the "or" is likely the source of our problems. In other words, the Global Control Store (where the autoscaler polls info from) thinks there are resources in use when we don't expect resources to be in use.
What we actually need to debug this is a periodically emitted node-by-node breakdown of used and available resources.
I don't think Ray has a public API for info at that level of granularity. I'll suggest a patch to get that info in the operator/autoscaler logs.
This is very interesting, thank you @DmitriGekhtman ! Am happy to apply a patch and rerun.
Here's some code that can be added to monitor.py and load_metrics.py in the operator image to get more detailed info in the autoscaler logs https://github.com/ray-project/ray/pull/21978/files
There are many ways to build an operator image with patched Python code. I usually use setup-dev.py as described here.
Thanks, Dmitri! I reran the single Ray Tune workload with the Ray Operator Autoscaler modifications to get more detailed logging.
Below is a selected time-line with info from the Ray Tune job logging marked "Hyperp" and from the Ray Operator Autoscaler marked "Autosc".
The issue seems to be that the 2 idle nodes that previously ran workers in the Ray Tune job are considered to be in use because they
have copies of the object_store_memory. But it seems like their copies should be considered uninteresting at this point.
[I can provide the full logs if desired]
********Selected time-line**********
Initial state: CPU head node [10.0.2.144] running
16:18:30 Hyperp started forest_cover 1hr gpu auto train, max_samples=10, max_concurrency=3
16:18:49 Hyperp pending: 1
16:24:50 Hyperp pending: 1, running: 1 [10.0.1.161]
16:30:22 Hyperp pending: 1, running: 2 [10.0.1.161, 10.0.1.107]
16:36:00 Hyperp pending: 0, running: 3 [10.0.1.161, 10.0.1.107, 10.0.3.161], terminated: 1
16:42:40 Autosc adding: 10.0.1.106
16:44:55 Hyperp pending: 0, running: 2 [10.0.1.161, 10.0.3.161], terminated: 8
16:45:06 Hyperp pending: 0, running: 1 [10.0.1.161], terminated: 9
16:46:22 Autosc removed: 10.0.1.106 [never used]
2022-01-30 16:50:06,070 DEBUG load_metrics.py:142 -- Node with ip 10.0.2.144 is in use.
2022-01-30 16:50:06,070 DEBUG load_metrics.py:151 -- Node at ip 10.0.2.144 is using 36035847.0 object_store_memory.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:142 -- Node with ip 10.0.1.161 is in use.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 1.0 GPU.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 4.0 CPU_group_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 1.0 GPU_group_0_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 4.0 CPU.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 1.0 GPU_group_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 4.0 CPU_group_0_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 36035847.0 object_store_memory.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:142 -- Node with ip 10.0.1.107 is in use.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.107 is using 36035847.0 object_store_memory.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:142 -- Node with ip 10.0.3.161 is in use.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.3.161 is using 36035847.0 object_store_memory.
17:18:50 Hyperp pending: 0, running: 0, terminated: 10
17:21 Hyperp stopped
17:24:22 Autosc removed: 10.0.1.161, 10.0.1.107, 10.0.3.161
Thanks, Anne! Looks like we're getting closer to figuring this out.
Some of the Ray worker nodes are unexpectedly kept from scaling down despite not being involved in active Ray workloads. The reason is that they appear to still be using object store memory.
Polling experts on the Ray internals,
@rkooo567 @wuisawesome @ericl
What are some potential mitigations or directions for further investigation?
Hmmm seems like there are a few options here (I'm assuming that these nodes aren't storing anything useful in their object store).
cc. @scv119 as well
I'm assuming that these nodes aren't storing anything useful in their object store.
Correct.
Hmm do we know what created those objects and what is referencing them? "ray memory" can show you more information on this.
Logging shows the Ray head was using that amount of object_store_memory just after the Ray Tune job started:
example-cluster,default:2022-01-30 16:18:53,019 DEBUG load_metrics.py:151 -- Node at ip 10.0.2.144 is using 36035847.0 object_store_memory.
When each new worker was added and started running a trial, it started showing using the same amount of memory
example-cluster,default:2022-01-30 16:24:53,335 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 36035847.0 object_store_memory.
example-cluster,default:2022-01-30 16:30:23,386 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.107 is using 36035847.0 object_store_memory.
example-cluster,default:2022-01-30 16:36:02,283 DEBUG load_metrics.py:151 -- Node at ip 10.0.3.161 is using 36035847.0 object_store_memory.
After Ray Tune job ended at 17:21, the head and the 3 worker nodes stopped using any object_store_memory
At that point, 2 of the 3 workers had actually been idle since around 16:45.
@richardliaw @xwjiang2010
Looking again at the summary here.
At 16:45:06
the nodes at ips 10.0.1.161
, 10.0.3.161
have finished their trials, yet they seem to continue to be using object store until 17:21 when the whole Tune run stops. That prevents the nodes from timely downscaling.
The amount of object store memory on the stuck workers 10.0.3.161
10.0.1.107
is the same 36035847.0
as on the head node 10.0.2.144
, suggesting that the objects preventing scale-down could be copies of the same Tune-related artifacts.
Do you guys know if there's anything in Tune that could cause this?
I think it could be an artifact of reference counting in Ray. It sounds like all the trials are referencing a shared copy of an object (this 36MB object). However, Ray doesn't evict secondary copies of an object until all references fall out of scope. So the nodes are reporting memory in use, when the object could be evicted safely in this case.
There are some potential fixes:
Don't report memory usage of secondary objects for the purposes of autoscaler downscaling.
I vote for this approach.
^ Sounds like the correct solution. Is it straightforward to implement? Would it be difficult to distinguish secondary objects?
hah, prompt answer in <20 lines of code
Thanks for this fix! I am trying to test the fix in my setup, which is deployment of Ray onto a K8s cluster running ludwig automl. The image I deploy for the head and workers is ludwigai/ludwig-ray-gpu:tf-legacy. @tgaddair mentioned that I could add
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
to the worker setupCommands to load in the latest wheels but when I do that, the workers don't come up; they report
(base) MacBook-Air-2:trips anne$ kubectl logs -f example-cluster-ray-worker-zrv4q
2022-02-02 09:49:07,438 INFO scripts.py:862 -- Local node IP: 10.0.1.248
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2214, in main
return cli()
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 791, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 865, in start
ray_params, head=False, shutdown_at_exit=block, spawn_reaper=block
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 189, in __init__
"session_name", ray_constants.KV_NAMESPACE_SESSION
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 1540, in _internal_kv_get_with_retry
f"Could not read '{key}' from GCS (redis). "
RuntimeError: Could not read 'b'session_name'' from GCS (redis). If using Redis, did Redis start successfully?
Does anyone know what might be going wrong and/or other ways I can run with this great change?
Did you also add it to the head setup commands? If that still doesn't work, it might be due to the Ray operator / image being out of date, cc @mwtian on possible redis-removal related compatibility issues with upgrading to nightly.
Head setup commands are not supported for Ray deployment on K8s, but I did run the head pip setup command manually and saw the same problem on the head itself, before any workers were even added for scaling.
In fact, after I saw that the head didn't work if I used that pip install, I stopped the ray cluster, redeployed it without running the pip command on the head, and only ran the setup command on the workers and they had the same problem that I'd already seen on the head.
What autoscaler version is running? It is possible that an earlier version (1.9?) of autoscaler is not launching the worker correctly by setting the correct flags. Are you able to see the flags that worker nodes are launched with?
The ray-operator node in my k8s cluster (which contains autoscaler) is deployed with image: rayproject/ray:nightly so I assume it is very up to date since I deployed it about 4 hours ago.
What do you mean by "flags that worker nodes are launched with"? Is it in the autoscaler log? Since the problem can be reproduced on the head itself, we could focus there? The head starts up fine and then immediately fails once the update pip install is executed on the head.
I can crank up a cluster with minWorkers=0, and the head comes up fine. Then on the head, if I execute the pip install for the updated Ray, the head starts complaining with the error message
Here's the tasks that were running on the head
(base) ray@example-cluster-ray-head-gnh2g:/ludwig$ ps auxw
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ray 1 0.0 0.0 20132 3476 ? Ss 09:29 0:00 /bin/bash -c -- trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;
ray 8 0.0 0.0 6324 832 ? S 09:29 0:00 tail -f /tmp/raylogs
ray 153 0.1 0.0 74944 7424 ? Sl 09:29 0:00 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
ray 158 0.0 0.0 74944 7156 ? Sl 09:29 0:00 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:58932
ray 163 0.9 0.0 1029800 27588 ? Sl 09:29 0:01 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=10.0.6.166 --redis_po
ray 180 1.1 0.1 766424 85360 ? Sl 09:29 0:01 /home/ray/anaconda3/bin/python -m ray.util.client.server --redis-address=10.0.6.166:6379 --host=0.0.0.0 --port=10001
ray 189 1.3 0.2 4390068 141256 ? Sl 09:29 0:01 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=0
ray 223 0.8 0.0 17337936 22428 ? Sl 09:29 0:00 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_
ray 230 0.8 0.1 310732 81592 ? Sl 09:29 0:00 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py --redis
ray 261 1.5 0.2 3862168 138876 ? Sl 09:29 0:01 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-ad
ray 409 0.1 0.0 20396 3836 pts/0 Ss 09:31 0:00 bash
ray 443 0.0 0.0 36164 3336 pts/0 R+ 09:31 0:00 ps auxw
I just noticed this in the ray log
Ray client servers support
--address
flag instead of--redis-address
.
Does that mean the ray head client command is no longer working since it uses --redis-address?
I'd recommend if possible running head and worker images with the changes built-in rather than running pip install at runtime or in setup commands.
Here is autoscaler logging for a worker startup
2022-02-02 09:50:28,732 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Got remote shell [LogTimer=31832ms]
2022-02-02 09:50:28,750 INFO updater.py:374 -- Updating cluster configuration.^[[0m^[[2m [hash=18639bb6a3ba134e2e4110e55ab38ba97a8dcd39]^[[22m^[[0m
2022-02-02 09:50:28,794 INFO updater.py:380 -- ^[[37mNew status^[[39m: ^[[1msyncing-files^[[22m
2022-02-02 09:50:28,794 INFO updater.py:238 -- ^[[2m[2/7]^[[22m ^[[36mProcessing file mounts^[[39m
2022-02-02 09:50:28,794 INFO updater.py:256 -- ^[[2m[3/7]^[[22m No worker file mounts to sync
2022-02-02 09:50:28,840 INFO updater.py:391 -- ^[[37mNew status^[[39m: ^[[1msetting-up^[[22m
2022-02-02 09:50:28,840 INFO updater.py:434 -- ^[[2m[4/7]^[[22m No initialization commands to run.
2022-02-02 09:50:28,840 INFO updater.py:439 -- ^[[2m[5/7]^[[22m ^[[36mInitalizing command runner^[[39m
2022-02-02 09:50:28,841 INFO updater.py:450 -- ^[[2m[6/7]^[[22m ^[[36mRunning setup commands^[[39m
2022-02-02 09:50:28,841 INFO updater.py:470 -- ^[[2m(0/2)^[[22m ^[[1mexport RAY_HEAD_IP=10.0.6.166; pip install boto3==1.17.106 awscli==1.19.106 botocore==1.20.106 s3fs==2021.10.0 aiobotocore==1.4.2 s3transfer==0.4.0 fsspec==2021.10.0^[[22m^[[26m
2022-02-02 09:50:38,921 INFO updater.py:470 -- ^[[2m(1/2)^[[22m ^[[1mexport RAY_HEAD_IP=10.0.6.166; pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl^[[22m^[[26m
2022-02-02 09:50:46,593 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Setup commands succeeded [LogTimer=17752ms]
2022-02-02 09:50:46,594 INFO updater.py:489 -- ^[[2m[7/7]^[[22m ^[[36mStarting the Ray runtime^[[39m
2022-02-02 09:51:29,755 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Ray start commands failed [LogTimer=43161ms]
2022-02-02 09:51:29,755 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Applied config 18639bb6a3ba134e2e4110e55ab38ba97a8dcd39 [LogTimer=92912ms]
2022-02-02 09:51:29,811 ERR updater.py:157 -- ^[[31mNew status: ^[[1mupdate-failed^[[22m^[[26m^[[39m
2022-02-02 09:51:29,811 ERR updater.py:159 -- ^[[31m!!!^[[39m
2022-02-02 09:51:29,811 ERR updater.py:164 -- ^[[31mSetup command `^[[1mkubectl -n default exec -it example-cluster-ray-worker-hks5g -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":7,"GPU":1,"example-resource-a":1,"example-resource-b":1,"memory":21045339750}'"'"';export RAY_HEAD_IP=10.0.6.166; ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 &> /tmp/raylogs)'^[[22m^[[26m` failed with exit code 1. stderr:^[[39m
2022-02-02 09:51:29,811 ERR updater.py:171 -- ^[[31m!!!^[[39m
2022-02-02 09:51:38,319 INFO updater.py:323 -- ^[[37mNew status^[[39m: ^[[1mwaiting-for-ssh^[[22m
2022-02-02 09:51:38,320 INFO updater.py:261 -- ^[[2m[1/7]^[[22m ^[[36mWaiting for SSH to become available^[[39m
2022-02-02 09:51:38,320 INFO updater.py:265 -- Running `^[[1muptime^[[22m^[[26m` as a test.
Installing collected packages: ray
Attempting uninstall: ray
Found existing installation: ray 1.9.2
Uninstalling ray-1.9.2:
Successfully uninstalled ray-1.9.2
Successfully installed ray-2.0.0.dev0
example-cluster,default:2022-02-02 09:50:46,594 INFO command_runner.py:179 -- NodeUpdater: example-cluster-ray-worker-hks5g: Running kubectl -n default exec -it example-cluster-ray-worker-hks5g -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":7,"GPU":1,"example-resource-a":1,"example-resource-b":1,"memory":21045339750}'"'"';export RAY_HEAD_IP=10.0.6.166; ray stop)'
Unable to use a TTY - input is not a terminal or the right kind of file
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Thanks for the detailed info. Looks like the failure is indeed from autoscaler trying to start workers.
I can crank up a cluster with minWorkers=0, and the head comes up fine. Then on the head, if I execute the pip install for the updated Ray, the head starts complaining with the error message
When executing pip install for the updated Ray, is the head node still running? Is the same error message generated, e.g. RuntimeError: Could not read 'b'session_name'' from GCS (redis). If using Redis, did Redis start successfully?
? I wonder why autoscaler is trying to start a worker in this case.
Does that mean the ray head client command is no longer working since it uses --redis-address?
In latest Ray, Ray client server accepts --address
instead of --redis-address
. Other components in Ray have been updated to use --address
as well. The Ray client servers support ...
message does not seem to be in the latest Ray repo. I wonder where the error log comes from.
Here is a more methodical set of steps, which also pulls the autoscaler out of the picture.
*) I started up a fresh 1.9.2 Ray cluster on a k8s cluster with a cpu-only
head and gpu workers, as in the previous case but with minWorkers=1
*) The head and 1 worker came up fine. I logged into each and verified
"ray --version" was 1.9.2 and "ray status" looked good.
*) Then I ran the following update on the head and on the worker:
"pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
*) Then I checked "ray --version" was 2.0.0.dev0" and "ray status" was good on head & worker.
*) Then I started a Ludwig AutoML job, which uses Ray Tune, with max_concurrency 1:
nohup python run_auto_train_1hr_nodeless.py >run.1hr 2>&1 &
*) And the output produced was:
(base) ray@example-cluster-ray-head-57vww:~/experiments/automl/validation/forest_cover$ cat run.1hr
nohup: ignoring input
covtype.data.gz: 10.7MB [00:00, 18.8MB/s]
Traceback (most recent call last):
File "run_auto_train_1hr_nodeless.py", line 15, in <module>
user_config={'hyperopt': {'executor': {'max_concurrent_trials': 1, 'gpu_resources_per_trial': 1, 'cpu_resources_per_trial': 4}}},
File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 108, in auto_train
dataset, target, time_limit_s, tune_for_memory, user_config, random_seed, use_reference_config, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 150, in create_auto_config
default_configs = _create_default_config(dataset, target, time_limit_s, random_seed)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/base_config.py", line 135, in _create_default_config
_ray_init()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/utils.py", line 88, in _ray_init
ray.init('auto', ignore_reinit_error=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1107, in init
connect_only=True,
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 190, in __init__
"session_name", ray_constants.KV_NAMESPACE_SESSION
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 1542, in _internal_kv_get_with_retry
f"Could not read '{key}' from GCS (redis). "
RuntimeError: Could not read 'b'session_name'' from GCS (redis). If using Redis, did Redis start successfully?
*) Here is the ludwig program I ran:
(base) ray@example-cluster-ray-head-57vww:~/experiments/automl/validation/forest_cover$ cat run_auto_train_1hr_nodeless.py
import logging
import pprint
from ludwig.datasets import forest_cover
from ludwig.automl import auto_train
forest_cover_df = forest_cover.load()
auto_train_results = auto_train(
dataset=forest_cover_df,
target='Cover_Type',
time_limit_s=3600,
tune_for_memory=False,
output_directory='s3://predibase-runs/nodeless/forest_cover/hours1/',
user_config={'hyperopt': {'executor': {'max_concurrent_trials': 1, 'gpu_resources_per_trial': 1, 'cpu_resources_per_trial': 4}}},
)
pprint.pprint(auto_train_results)
Here's what's running on the head
(base) ray@example-cluster-ray-head-57vww:~/experiments/automl/validation/forest_cover$ ps auxfww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ray 408 0.0 0.0 20396 3892 pts/0 Ss 14:32 0:00 bash
ray 533 0.0 0.0 36164 3332 pts/0 R+ 15:02 0:00 \_ ps auxfww
ray 1 0.0 0.0 20132 3552 ? Ss 13:33 0:00 /bin/bash -c -- trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;
ray 8 0.0 0.0 6324 868 ? S 13:33 0:00 tail -f /tmp/raylogs
ray 152 0.1 0.0 74944 7516 ? Sl 13:33 0:09 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
ray 157 0.0 0.0 81088 7200 ? Sl 13:33 0:03 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:49684
ray 162 1.3 0.0 1029800 28052 ? Sl 13:33 1:11 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=10.0.5.197 --redis_port=6379 --log_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --config_list=eyJvYmplY3Rfc3BpbGxpbmdfY29uZmlnIjogIntcInR5cGVcIjogXCJmaWxlc3lzdGVtXCIsIFwicGFyYW1zXCI6IHtcImRpcmVjdG9yeV9wYXRoXCI6IFwiL3RtcC9yYXkvc2Vzc2lvbl8yMDIyLTAyLTAyXzEzLTMzLTIyXzA2MzAwNF8xNDhcIn19IiwgImlzX2V4dGVybmFsX3N0b3JhZ2VfdHlwZV9mcyI6IHRydWV9 --gcs_server_port=0 --metrics-agent-port=52535 --node-ip-address=10.0.5.197 --redis_password=5241590000000000
ray 179 0.0 0.1 766424 85352 ? Sl 13:33 0:04 /home/ray/anaconda3/bin/python -m ray.util.client.server --redis-address=10.0.5.197:6379 --host=0.0.0.0 --port=10001 --mode=proxy --redis-password=5241590000000000 --metrics-agent-port=52535 --worker-shim-pid=179
ray 188 0.5 0.2 4390200 141664 ? Sl 13:33 0:29 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --redis-address=10.0.5.197:6379 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000
ray 222 0.8 0.0 17338196 22816 ? Sl 13:33 0:44 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --store_socket_name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.0.5.197 --redis_address=10.0.5.197 --redis_port=6379 --maximum_startup_concurrency=7 --static_resource_list=node:10.0.5.197,1.0,CPU,7,memory,37580963840,object_store_memory,16083075072 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/setup_worker.py /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.5.197 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --redis-address=10.0.5.197:6379 --temp-dir=/tmp/ray --metrics-agent-port=52535 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000 --java_worker_command= --cpp_worker_command= --native_library_path=/home/ray/anaconda3/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --log_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --resource_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --metrics-agent-port=52535 --metrics_export_port=62075 --object_store_memory=16083075072 --plasma_directory=/dev/shm --ray-debugger-external=0 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.0.5.197 --redis-address=10.0.5.197:6379 --metrics-export-port=62075 --dashboard-agent-port=52535 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --runtime-env-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
ray 260 0.6 0.2 3862572 138620 ? Sl 13:33 0:32 \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.0.5.197 --redis-address=10.0.5.197:6379 --metrics-export-port=62075 --dashboard-agent-port=52535 --listen-port=0 --node-manager-port=43305 --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --runtime-env-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
ray 229 0.3 0.1 310732 81616 ? Sl 13:33 0:17 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py --redis-address=10.0.5.197:6379 --logs-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
Versions / Dependencies
Ray 1.9.1
Reproduction script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.
Anything else
This problem is highly reproducible for me.
Are you willing to submit a PR?