[Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

32.16k stars 5.48k forks source link

[Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies #21870

Closed amholler closed 2 years ago

amholler commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Ray Autoscaler is not spinning down idle nodes if they ever ran a trial for the active Ray Tune job

The issue is seen on Ray Tune on a CPU-head w/GPU-workers (min=0,max=9) Ray 1.9.1 cluster.
The Higgs Ray Tune job is set up to run up to 10 trials using async hyperband for 1hr with
max_concurrency of 3.  I see at most 3 trials running (each requiring 1 gpu and 4 cpus).
Except in the first logging at the job startup, no PENDING trials are reported.

At the time the Ray Tune job is stopped for the 1hr time limit at 12:43:48, the console log (see below) shows:
*) 3 nodes running Higgs trials (10.0.4.71, 10.0.6.5, 10.0.4.38)
*) 2 nodes that previously ran Higgs trials but are not doing so now (10.0.2.129, 10.0.3.245).
The latter 2 nodes last reported running trials at 12:21:30, so they should be spun down.

Note that, in this run, multiple Ray Tune jobs were running in the same Ray cluster with some overlap:
 MushroomEdibility Ray Tune 1hr job ran from 11:20-l2:20
 ForestCover       Ray tune 1hr job ran from 11:22-12:22
 Higgs             Ray Tune 1hr job ran from 11:45-12:45
After 12:22, there was no overlap of jobs and hence no other use of 2 idle workers that remained on other than the historical Higgs use.
Two other nodes that became idle when MushroomEdibility and ForestCover completed were spun down at that point, leaving the other 2 idle nodes that higgs had used running.
In the same kind of scenario later in the run, I observed that after the Higgs job completed, all Higgs trial workers were spun down.

Current time: 2022-01-24 12:43:48 (running for 00:59:30.27) ...
Number of trials: 9/10 (3 RUNNING, 6 TERMINATED)

+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
| Trial name     | status     | loc             |   combiner.bn_momentum |   combiner.bn_virtual_bs |   combiner.num_steps |   combiner.output_size |   combiner.relaxation_factor |   combiner.size |   combiner.sparsity |   training.batch_size |   training.decay_rate |   training.decay_steps |   training.learning_rate |   iter |   total time (s) |   metric_score |
|----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------|
| trial_04bb0f22 | RUNNING    | 10.0.4.71:1938  |                   0.7  |                     2048 |                    4 |                      8 |                          1   |              32 |              0.0001 |                  8192 |                  0.95 |                    500 |                    0.025 |     18 |         3566.91  |       0.489641 |
| trial_3787a9c4 | RUNNING    | 10.0.6.5:17263  |                   0.9  |                     4096 |                    7 |                     24 |                          1.5 |              64 |              0      |                   256 |                  0.9  |                  10000 |                    0.01  |        |                  |                |
| trial_39a0ad6e | RUNNING    | 10.0.4.38:8657  |                   0.8  |                      256 |                    3 |                     16 |                          1.2 |              64 |              0.001  |                  4096 |                  0.95 |                   2000 |                    0.005 |      4 |         1268.2   |       0.50659  |
| trial_05396980 | TERMINATED | 10.0.2.129:2985 |                   0.8  |                      256 |                    9 |                    128 |                          1   |              32 |              0      |                  2048 |                  0.95 |                  10000 |                    0.005 |      1 |          913.295 |       0.53046  |
| trial_059befa6 | TERMINATED | 10.0.3.245:282  |                   0.98 |                     1024 |                    3 |                      8 |                          1   |               8 |              1e-06  |                  1024 |                  0.8  |                    500 |                    0.005 |      1 |          316.455 |       0.573849 |
| trial_c433a60c | TERMINATED | 10.0.3.245:281  |                   0.8  |                     1024 |                    7 |                     24 |                          2   |               8 |              0.001  |                   256 |                  0.95 |                  20000 |                    0.01  |      1 |         1450.99  |       0.568653 |
| trial_277d1a8a | TERMINATED | 10.0.4.38:8658  |                   0.9  |                      256 |                    5 |                     64 |                          1.5 |              64 |              0.0001 |                   512 |                  0.95 |                  20000 |                    0.005 |      1 |          861.914 |       0.56506  |
| trial_26f6b0b0 | TERMINATED | 10.0.2.129:3079 |                   0.6  |                      256 |                    3 |                     16 |                          1.2 |              16 |              0.01   |                  1024 |                  0.9  |                   8000 |                    0.005 |      1 |          457.482 |       0.56582  |
| trial_2acddc5e | TERMINATED | 10.0.3.245:504  |                   0.6  |                      512 |                    5 |                     32 |                          2   |               8 |              0      |                  2048 |                  0.95 |                  10000 |                    0.025 |      1 |          447.483 |       0.594953 |
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+

Versions / Dependencies

Ray 1.9.1

Reproduction script

https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.

Anything else

This problem is highly reproducible for me.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

xwjiang2010 commented 2 years ago

Asking Anne to help repro it on single Tune session setup.

amholler commented 2 years ago

My plan is to run 2 Ray Tune sessions serially, rather than in parallel, and check that idle workers are reclaimed.

amholler commented 2 years ago

And note that if the problem doesn't reproduce in the serial scenario, we could switch the blog to using serial Ray Tune sessions while we figure out what's the problem in scale down with parallel Ray Tune sessions.

amholler commented 2 years ago

I can reproduce the problem with a single Ray Tune job running by itself on an empty Ray Cluster.
Idle workers are not spun down until the active Ray Tune job completes, even though they were idle
for more than 30 minutes.  Details below.

Setup: Ray 1.9.1 cluster w/CPU-head & 0-9 GPU workers, 0 workers running initially, idleTimeoutMinutes: 3

Experiment: Start AutoML job on ludwig tf-legacy using this script:
 https://github.com/ludwig-ai/experiments/blob/main/automl/validation/forest_cover/run_auto_train_1hr_nodeless.py
which specifies max_concurrent 3.  AutoML runs 10 trials.

Selected Time-line highlighting problem:
07:46 forest_cover dataset fetch starts
07:47 forest_cover Ray Tune train starts, with async hyperband to terminate unpromising trials and 1 hr time budget.
07:53 Ray Autoscaler Ray start commands succeeded for example-cluster-ray-worker-w8jw7 [10.0.1.221]
07:59 Ray Autoscaler Ray start commands succeeded for example-cluster-ray-worker-l7rzf [10.0.6.169]
08:04 Ray Autoscaler Ray start commands succeeded for example-cluster-ray-worker-vpckp [10.0.3.241]
08:05 3 trials running, 1 terminated
 Busy nodes: 10.0.1.221, 10.0.6.169, 10.0.3.241
 Idle nodes: none
08:13 2 trials running, 8 terminated; 10/10 -> no more trials will be started
 Busy nodes: 10.0.1.221, 10.0.6.169
 Idle nodes: 10.0.3.241
08:14 1 trial running, 9 terminated; 10/10 -> no more trials will be started
 Busy nodes: 10.0.1.221
 Idle nodes: 10.0.3.241, 10.0.6.169
08:47 forest_cover Ray Tune Job ended
08:53 Ray Autoscaler logged worker termination for:
 10.0.6.169 example-cluster-ray-worker-l7rzf
 10.0.3.241 example-cluster-ray-worker-vpckp
 10.0.1.221 example-cluster-ray-worker-w8jw7

So 10.0.3.241,10.0.6.169 were idle from 8:14 to the end of the run at 8:47, but weren't spun down until 8:53,
i.e., after the active Ray Tune Job ended.

BTW: Note that this means that running the Ray Tune jobs serially will not workaround this problem, and fixing this is
critical to seeing better idle node reclamation performance from Autoscaling.

xwjiang2010 commented 2 years ago

Thanks Anne for the experiment. Could you maybe print placement_group_table in the gap period (i.e. during the time of 2/1 trials running, 8/9 terminated)? This would give us an idea of the placement groups requested. Also @krfricke

amholler commented 2 years ago

Here is an example ray status right after 8:14. Interestingly, it turns out an additional worker got added to the ray cluster that never got used by the Ray Tune job.

======== Autoscaler status: 2022-01-26 08:14:08.351179 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-node
 4 worker-node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 4.0/35.0 CPU (4.0 used of 4.0 reserved in placement groups)
 1.0/4.0 GPU (1.0 used of 1.0 reserved in placement groups)
 0.0/4.0 accelerator_type:T4
 0.0/4.0 example-resource-a
 0.0/4.0 example-resource-b
 0.00/113.400 GiB memory
 0.13/48.350 GiB object_store_memory

amholler commented 2 years ago

Right after that, in the autoscaler logs we see:

example-cluster,default:2022-01-26 08:14:08,434 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 4 nodes\n - MostDelayedHeartbeats: {'10.0.1.221': 0.3186984062194824, '10.0.3.143': 0.3186213970184326, '10.0.3.241': 0.3185453414916992, '10.0.2.231': 0.31847691535949707, '10.0.6.169': 0.3184034824371338}\n - NodeIdleSeconds: Min=0 Mean=10 Max=51\n - ResourceUsage: 4.0/35.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/4.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 example-resource-a, 0.0/4.0 example-resource-b, 0.0 GiB/113.4 GiB memory, 0.13 GiB/48.35 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - worker-node: 4" True None
example-cluster,default:2022-01-26 08:14:08,435 DEBUG legacy_info_string.py:24 -- Cluster status: 4 nodes
 - MostDelayedHeartbeats: {'10.0.1.221': 0.3186984062194824, '10.0.3.143': 0.3186213970184326, '10.0.3.241': 0.3185453414916992, '10.0.2.231': 0.31847691535949707, '10.0.6.169': 0.3184034824371338}
 - NodeIdleSeconds: Min=0 Mean=10 Max=51
 - ResourceUsage: 4.0/35.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/4.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 example-resource-a, 0.0/4.0 example-resource-b, 0.0 GiB/113.4 GiB memory, 0.13 GiB/48.35 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0

amholler commented 2 years ago

Hey, I see something that could be interesting; an unexpected (to me) reset of NodeIdleSeconds. Could this be associated with the problem?

At 2022-01-26 08:13:23,598 we see this which seems reasonable:
 - NodeIdleSeconds: Min=0 Mean=1 Max=6

The max idle time goes up every 6 seconds as you would expect:
 - NodeIdleSeconds: Min=0 Mean=2 Max=12
 - NodeIdleSeconds: Min=0 Mean=4 Max=19
 - NodeIdleSeconds: Min=0 Mean=5 Max=25
 - NodeIdleSeconds: Min=0 Mean=6 Max=31
 - NodeIdleSeconds: Min=0 Mean=7 Max=38
 - NodeIdleSeconds: Min=0 Mean=9 Max=44
 - NodeIdleSeconds: Min=0 Mean=10 Max=51
 - NodeIdleSeconds: Min=0 Mean=11 Max=57
 - NodeIdleSeconds: Min=0 Mean=13 Max=64
 - NodeIdleSeconds: Min=0 Mean=14 Max=70
 - NodeIdleSeconds: Min=0 Mean=15 Max=76
 - NodeIdleSeconds: Min=0 Mean=16 Max=83
 - NodeIdleSeconds: Min=0 Mean=18 Max=89
 - NodeIdleSeconds: Min=0 Mean=19 Max=95
 - NodeIdleSeconds: Min=0 Mean=20 Max=101
 - NodeIdleSeconds: Min=0 Mean=21 Max=108
 - NodeIdleSeconds: Min=0 Mean=23 Max=114
 - NodeIdleSeconds: Min=0 Mean=24 Max=121
 - NodeIdleSeconds: Min=0 Mean=25 Max=127
 - NodeIdleSeconds: Min=0 Mean=27 Max=133
 - NodeIdleSeconds: Min=0 Mean=28 Max=140
 - NodeIdleSeconds: Min=0 Mean=29 Max=146
 - NodeIdleSeconds: Min=0 Mean=30 Max=152
 - NodeIdleSeconds: Min=0 Mean=32 Max=159
 - NodeIdleSeconds: Min=0 Mean=33 Max=165
 - NodeIdleSeconds: Min=0 Mean=34 Max=171
 - NodeIdleSeconds: Min=0 Mean=35 Max=178
 - NodeIdleSeconds: Min=0 Mean=37 Max=184

Then the extra worker that was brought up but never used by Ray Tune is removed from the cluster.

And then we see this at 2022-01-26 08:16:27,886:
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
and these values stay stuck at all 0s for the rest of the Ray Tune run.  It is like the
removal of the unused node caused all the node idle history for the workers that
participated in the Ray Tune job to be wiped clean.

At 2022-01-26 08:50:23,208, after the job ends, the values climb up again and the
associated nodes are removed.

amholler commented 2 years ago

Could you maybe print placement_group_table in the gap period (i.e. during the time of 2/1 trials running, 8/9 terminated)? This would give us an idea of the placement groups requested.

Here is an example of what I think you want from the period when only 1 trial is running

example-cluster,default:2022-01-26 08:15:56,569 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
example-cluster,default:2022-01-26 08:15:56,569 DEBUG resource_demand_scheduler.py:247 -- Resource demands: [{'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}]
example-cluster,default:2022-01-26 08:15:56,569 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:15:56,569 WARNING resource_demand_scheduler.py:746 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
example-cluster,default:2022-01-26 08:15:56,570 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:15:56,687 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
example-cluster,default:2022-01-26 08:15:56,802 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"node:10.0.3.143": [0.0, 1.0], "memory": [0.0, 121762322840.0], "CPU": [4.0, 35.0], "object_store_memory": [144143388.0, 51915278743.0], "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "example-resource-b": [0.0, 4.0], "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "bundle_group_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "accelerator_type:T4": [0.0, 4.0], "node:10.0.1.221": [0.0, 1.0], "GPU": [1.0, 4.0], "example-resource-a": [0.0, 4.0], "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "node:10.0.6.169": [0.0, 1.0], "node:10.0.3.241": [0.0, 1.0], "node:10.0.2.231": [0.0, 1.0]}, "resource_demand": [[{"bundle_group_c6b7c78b2813e3467c825ee335e91f59": 0.001}, 1], [{"bundle_group_bf6d839e903a2bcef6e25bd54b65fc50": 0.001}, 1], [{"bundle_group_3116b209b405aecc460bd44c166069fc": 0.001}, 1], [{"bundle_group_d67d3fcd68c89cea58085c5b1b191070": 0.001}, 1], [{"bundle_group_f2fd2a403e6cf780454857a52f5f504e": 0.001}, 1], [{"bundle_group_7d793a59e156378b15521f0b7c7b04bb": 0.001}, 1]], "pg_demand": [], "request_demand": [], "node_types": [[{"object_store_memory": 16082992742.0, "CPU": 7.0, "memory": 37580963840.0, "node:10.0.3.143": 1.0}, 1], [{"bundle_group_c9c5bfa85076b90014e73f72213fa4c4": 1000.0, "node:10.0.1.221": 1.0, "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "CPU": 7.0, "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "memory": 21045339750.0, "accelerator_type:T4": 1.0, "object_store_memory": 8957606707.0, "example-resource-a": 1.0, "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1000.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"object_store_memory": 8958260428.0, "CPU": 7.0, "memory": 21045339750.0, "node:10.0.6.169": 1.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"object_store_memory": 8958321868.0, "CPU": 7.0, "memory": 21045339750.0, "example-resource-a": 1.0, "node:10.0.3.241": 1.0, "accelerator_type:T4": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"CPU": 7.0, "object_store_memory": 8958096998.0, "memory": 21045339750.0, "node:10.0.2.231": 1.0, "example-resource-a": 1.0, "accelerator_type:T4": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1]], "head_ip": null}, "time": 1643213755.5013282, "monitor_pid": 7054, "autoscaler_report": {"active_nodes": {"head-node": 1, "worker-node": 4}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None

xwjiang2010 commented 2 years ago

@wuisawesome Can you take a look as Dataplane oncall? I suspect this is not a Tune issue (asking Anne to print out placement_group_table to verify) but an autoscaler issue (See Anne's comments above). Can you advise on how to proceed?

xwjiang2010 commented 2 years ago

@amholler Just saw your last comment. I am not very familiar with this output. @rkooo567 to see if there is anything suspicious (especially the resource demand session, I saw a lot of 0.001 placeholder. @rkooo567 is that concerning?). Json beautified result:

{
  "load_metrics_report": {
    "usage": {
      "node:10.0.3.143": [
        0,
        1
      ],
      "memory": [
        0,
        121762322840
      ],
      "CPU": [
        4,
        35
      ],
      "object_store_memory": [
        144143388,
        51915278743
      ],
      "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": [
        0,
        1000
      ],
      "example-resource-b": [
        0,
        4
      ],
      "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": [
        4,
        4
      ],
      "bundle_group_c9c5bfa85076b90014e73f72213fa4c4": [
        0,
        1000
      ],
      "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": [
        1,
        1
      ],
      "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [
        1,
        1
      ],
      "accelerator_type:T4": [
        0,
        4
      ],
      "node:10.0.1.221": [
        0,
        1
      ],
      "GPU": [
        1,
        4
      ],
      "example-resource-a": [
        0,
        4
      ],
      "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [
        4,
        4
      ],
      "node:10.0.6.169": [
        0,
        1
      ],
      "node:10.0.3.241": [
        0,
        1
      ],
      "node:10.0.2.231": [
        0,
        1
      ]
    },
    "resource_demand": [
      [
        {
          "bundle_group_c6b7c78b2813e3467c825ee335e91f59": 0.001
        },
        1
      ],
      [
        {
          "bundle_group_bf6d839e903a2bcef6e25bd54b65fc50": 0.001
        },
        1
      ],
      [
        {
          "bundle_group_3116b209b405aecc460bd44c166069fc": 0.001
        },
        1
      ],
      [
        {
          "bundle_group_d67d3fcd68c89cea58085c5b1b191070": 0.001
        },
        1
      ],
      [
        {
          "bundle_group_f2fd2a403e6cf780454857a52f5f504e": 0.001
        },
        1
      ],
      [
        {
          "bundle_group_7d793a59e156378b15521f0b7c7b04bb": 0.001
        },
        1
      ]
    ],
    "pg_demand": [],
    "request_demand": [],
    "node_types": [
      [
        {
          "object_store_memory": 16082992742,
          "CPU": 7,
          "memory": 37580963840,
          "node:10.0.3.143": 1
        },
        1
      ],
      [
        {
          "bundle_group_c9c5bfa85076b90014e73f72213fa4c4": 1000,
          "node:10.0.1.221": 1,
          "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 4,
          "CPU": 7,
          "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1,
          "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": 4,
          "memory": 21045339750,
          "accelerator_type:T4": 1,
          "object_store_memory": 8957606707,
          "example-resource-a": 1,
          "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": 1,
          "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1000,
          "GPU": 1,
          "example-resource-b": 1
        },
        1
      ],
      [
        {
          "object_store_memory": 8958260428,
          "CPU": 7,
          "memory": 21045339750,
          "node:10.0.6.169": 1,
          "accelerator_type:T4": 1,
          "example-resource-a": 1,
          "GPU": 1,
          "example-resource-b": 1
        },
        1
      ],
      [
        {
          "object_store_memory": 8958321868,
          "CPU": 7,
          "memory": 21045339750,
          "example-resource-a": 1,
          "node:10.0.3.241": 1,
          "accelerator_type:T4": 1,
          "GPU": 1,
          "example-resource-b": 1
        },
        1
      ],
      [
        {
          "CPU": 7,
          "object_store_memory": 8958096998,
          "memory": 21045339750,
          "node:10.0.2.231": 1,
          "example-resource-a": 1,
          "accelerator_type:T4": 1,
          "GPU": 1,
          "example-resource-b": 1
        },
        1
      ]
    ],
    "head_ip": null
  },
  "time": 1643213755.5013282,
  "monitor_pid": 7054,
  "autoscaler_report": {
    "active_nodes": {
      "head-node": 1,
      "worker-node": 4
    },
    "pending_nodes": [],
    "pending_launches": {},
    "failed_nodes": []
  }
}

@amholler I was thinking if you could use ray.util.placement_group_table() to print out the resources at that moment when this is happening. Sample output looks like

{'23e6d966b7b3748eacf8d0dc19fd54a5': {'bundles': {0: {'CPU': 1.0},
                                                  1: {'CPU': 4.0}},
                                      'name': '__tune_83e72cbd__a565778b',
                                      'placement_group_id': '23e6d966b7b3748eacf8d0dc19fd54a5',
                                      'state': 'REMOVED',
                                      'stats': {'end_to_end_creation_latency_ms': 0.0,
                                                'highest_retry_delay_ms': 1000.0,
                                                'scheduling_attempt': 236,
                                                'scheduling_latency_ms': 0.0,
                                                'scheduling_state': 'REMOVED'},
                                      'strategy': 'PACK'},
 '23f1bf68e9a140e0e8326ba9bf60c573': {'bundles': {0: {'CPU': 1.0},
                                                  1: {'CPU': 4.0}},
                                      'name': '__tune_3f706cb3__e2694e22',
                                      'placement_group_id': '23f1bf68e9a140e0e8326ba9bf60c573',
                                      'state': 'PENDING',
                                      'stats': {'end_to_end_creation_latency_ms': 0.0,
                                                'highest_retry_delay_ms': 1000.0,
                                                'scheduling_attempt': 365,
                                                'scheduling_latency_ms': 0.0,
                                                'scheduling_state': 'NO_RESOURCES'},
                                      'strategy': 'PACK'}}

amholler commented 2 years ago

It seems like a key clue at this point is the odd reset-and-stuck-at-zero of NodeIdleSeconds described by me earlier in this ticket, so I would prefer not to do additional runs until we are able to see if they are worthwhile, given this information.

Thoughts @ericl ?

richardliaw commented 2 years ago

@DmitriGekhtman can you take a look here? It seems to be a autoscaling/scheduling bug.

amholler commented 2 years ago

I think the key area of the autoscaler log is below, where NodeIdleSeconds goes to 0 after the idle node is removed:

example-cluster,default:2022-01-26 08:16:21,639 DEBUG legacy_info_string.py:24 -- Cluster status: 4 nodes
 - MostDelayedHeartbeats: {'10.0.1.221': 0.3429567813873291, '10.0.3.143': 0.3428807258605957, '10.0.3.241': 0.3428037166595459, '10.0.2.231': 0.34258198738098145, '10.0.6.169': 0.34249281883239746}
 - NodeIdleSeconds: Min=0 Mean=37 Max=184
 - ResourceUsage: 4.0/35.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/4.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/4.0 example-resource-a, 0.0/4.0 example-resource-b, 0.0 GiB/113.4 GiB memory, 0.13 GiB/48.35 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - worker-node: 4
example-cluster,default:2022-01-26 08:16:21,856 INFO autoscaler.py:437 -- StandardAutoscaler: Terminating the node with id example-cluster-ray-worker-5c86z and ip 10.0.2.231. (idle)
example-cluster,default:2022-01-26 08:16:21,885 INFO autoscaler.py:500 -- Draining 1 raylet(s).
example-cluster,default:2022-01-26 08:16:21,886 INFO node_provider.py:171 -- KubernetesNodeProvider: calling delete_namespaced_pod
example-cluster,default:2022-01-26 08:16:21,950 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-l7rzf is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:21,988 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-vpckp is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,036 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-w8jw7 is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,091 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-l7rzf is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,174 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-vpckp is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,206 DEBUG autoscaler.py:1134 -- example-cluster-ray-worker-w8jw7 is not being updated and passes config check (can_update=True).
example-cluster,default:2022-01-26 08:16:22,383 DEBUG resource_demand_scheduler.py:189 -- Cluster resources: [{'memory': 37580963840.0, 'node:10.0.3.143': 1.0, 'CPU': 7.0, 'object_store_memory': 16046956895.0}, {'object_store_memory': 8922224581.0, 'CPU': 7.0, 'memory': 21045339750.0, 'GPU': 1.0, 'accelerator_type:T4': 1.0, 'example-resource-b': 1.0, 'example-resource-a': 1.0, 'node:10.0.6.169': 1.0}, {'CPU': 7.0, 'GPU': 1.0, 'accelerator_type:T4': 1.0, 'memory': 21045339750.0, 'example-resource-a': 1.0, 'node:10.0.3.241': 1.0, 'example-resource-b': 1.0, 'object_store_memory': 8922286021.0}, {'memory': 21045339750.0, 'example-resource-a': 1.0, 'bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4': 1000.0, 'example-resource-b': 1.0, 'node:10.0.1.221': 1.0, 'accelerator_type:T4': 1.0, 'CPU': 3.0, 'bundle_group_c9c5bfa85076b90014e73f72213fa4c4': 1000.0, 'object_store_memory': 8921570860.0, 'GPU': 0.0, 'GPU_group_c9c5bfa85076b90014e73f72213fa4c4': 0.0, 'CPU_group_c9c5bfa85076b90014e73f72213fa4c4': 0.0, 'GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4': 0.0, 'CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4': 0.0}]
example-cluster,default:2022-01-26 08:16:22,383 DEBUG resource_demand_scheduler.py:190 -- Node counts: defaultdict(<class 'int'>, {'head-node': 1, 'worker-node': 3})
example-cluster,default:2022-01-26 08:16:22,383 DEBUG resource_demand_scheduler.py:201 -- Placement group demands: []
example-cluster,default:2022-01-26 08:16:22,384 DEBUG resource_demand_scheduler.py:247 -- Resource demands: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}]
example-cluster,default:2022-01-26 08:16:22,384 DEBUG resource_demand_scheduler.py:248 -- Unfulfilled demands: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:16:22,384 WARNING resource_demand_scheduler.py:746 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
example-cluster,default:2022-01-26 08:16:22,384 DEBUG resource_demand_scheduler.py:252 -- Final unfulfilled: [{'bundle_group_f2fd2a403e6cf780454857a52f5f504e': 0.001}, {'bundle_group_d67d3fcd68c89cea58085c5b1b191070': 0.001}, {'bundle_group_c6b7c78b2813e3467c825ee335e91f59': 0.001}, {'bundle_group_bf6d839e903a2bcef6e25bd54b65fc50': 0.001}, {'bundle_group_7d793a59e156378b15521f0b7c7b04bb': 0.001}, {'bundle_group_3116b209b405aecc460bd44c166069fc': 0.001}]
example-cluster,default:2022-01-26 08:16:22,460 DEBUG resource_demand_scheduler.py:271 -- Node requests: {}
example-cluster,default:2022-01-26 08:16:22,542 INFO monitor.py:331 -- :event_summary:Removing 1 nodes of type worker-node (idle).
example-cluster,default:2022-01-26 08:16:22,542 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status' b'{"load_metrics_report": {"usage": {"object_store_memory": [144143388.0, 51915278743.0], "node:10.0.3.143": [0.0, 1.0], "memory": [0.0, 121762322840.0], "CPU": [4.0, 35.0], "GPU": [1.0, 4.0], "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "example-resource-b": [0.0, 4.0], "accelerator_type:T4": [0.0, 4.0], "bundle_group_c9c5bfa85076b90014e73f72213fa4c4": [0.0, 1000.0], "example-resource-a": [0.0, 4.0], "node:10.0.1.221": [0.0, 1.0], "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [1.0, 1.0], "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": [4.0, 4.0], "node:10.0.6.169": [0.0, 1.0], "node:10.0.3.241": [0.0, 1.0], "node:10.0.2.231": [0.0, 1.0]}, "resource_demand": [[{"bundle_group_f2fd2a403e6cf780454857a52f5f504e": 0.001}, 1], [{"bundle_group_3116b209b405aecc460bd44c166069fc": 0.001}, 1], [{"bundle_group_d67d3fcd68c89cea58085c5b1b191070": 0.001}, 1], [{"bundle_group_c6b7c78b2813e3467c825ee335e91f59": 0.001}, 1], [{"bundle_group_bf6d839e903a2bcef6e25bd54b65fc50": 0.001}, 1], [{"bundle_group_7d793a59e156378b15521f0b7c7b04bb": 0.001}, 1]], "pg_demand": [], "request_demand": [], "node_types": [[{"object_store_memory": 16082992742.0, "CPU": 7.0, "memory": 37580963840.0, "node:10.0.3.143": 1.0}, 1], [{"bundle_group_c9c5bfa85076b90014e73f72213fa4c4": 1000.0, "node:10.0.1.221": 1.0, "CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "CPU": 7.0, "example-resource-b": 1.0, "GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "CPU_group_c9c5bfa85076b90014e73f72213fa4c4": 4.0, "memory": 21045339750.0, "accelerator_type:T4": 1.0, "object_store_memory": 8957606707.0, "example-resource-a": 1.0, "GPU_group_c9c5bfa85076b90014e73f72213fa4c4": 1.0, "GPU": 1.0, "bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4": 1000.0}, 1], [{"object_store_memory": 8958260428.0, "CPU": 7.0, "memory": 21045339750.0, "node:10.0.6.169": 1.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"object_store_memory": 8958321868.0, "CPU": 7.0, "memory": 21045339750.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "node:10.0.3.241": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1], [{"CPU": 7.0, "object_store_memory": 8958096998.0, "memory": 21045339750.0, "node:10.0.2.231": 1.0, "accelerator_type:T4": 1.0, "example-resource-a": 1.0, "GPU": 1.0, "example-resource-b": 1.0}, 1]], "head_ip": null}, "time": 1643213781.1970084, "monitor_pid": 7054, "autoscaler_report": {"active_nodes": {"head-node": 1, "worker-node": 3}, "pending_nodes": [], "pending_launches": {}, "failed_nodes": []}}' True None
example-cluster,default:2022-01-26 08:16:27,545 DEBUG gcs_utils.py:228 -- internal_kv_get b'autoscaler_resource_request' None
example-cluster,default:2022-01-26 08:16:27,717 INFO load_metrics.py:171 -- LoadMetrics: Removed ip: 10.0.2.231.
example-cluster,default:2022-01-26 08:16:27,718 INFO load_metrics.py:177 -- LoadMetrics: Removed 1 stale ip mappings: {'10.0.2.231'} not in {'10.0.6.169', '10.0.3.241', '10.0.1.221', '10.0.3.143'}
example-cluster,default:2022-01-26 08:16:27,820 INFO autoscaler.py:303 -- 
======== Autoscaler status: 2022-01-26 08:16:27.820417 ========
Node status
---------------------------------------------------------------
Healthy:
 1 head-node
 3 worker-node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 4.0/28.0 CPU (4.0 used of 4.0 reserved in placement groups)
 1.0/3.0 GPU (1.0 used of 1.0 reserved in placement groups)
 0.0/3.0 accelerator_type:T4
 0.0/3.0 example-resource-a
 0.0/3.0 example-resource-b
 0.00/93.800 GiB memory
 0.13/40.007 GiB object_store_memory

Demands:
 (no resource demands)
example-cluster,default:2022-01-26 08:16:27,886 DEBUG gcs_utils.py:243 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 3 nodes\n - MostDelayedHeartbeats: {'10.0.1.221': 0.2756786346435547, '10.0.3.143': 0.2756021022796631, '10.0.3.241': 0.2755157947540283, '10.0.6.169': 0.27544498443603516}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/28.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/3.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 example-resource-a, 0.0/3.0 example-resource-b, 0.0 GiB/93.8 GiB memory, 0.13 GiB/40.01 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - worker-node: 3" True None
example-cluster,default:2022-01-26 08:16:27,887 DEBUG legacy_info_string.py:24 -- Cluster status: 3 nodes
 - MostDelayedHeartbeats: {'10.0.1.221': 0.2756786346435547, '10.0.3.143': 0.2756021022796631, '10.0.3.241': 0.2755157947540283, '10.0.6.169': 0.27544498443603516}
 - NodeIdleSeconds: Min=0 Mean=0 Max=0
 - ResourceUsage: 4.0/28.0 CPU, 4.0/4.0 CPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 4.0/4.0 CPU_group_c9c5bfa85076b90014e73f72213fa4c4, 1.0/3.0 GPU, 1.0/1.0 GPU_group_0_c9c5bfa85076b90014e73f72213fa4c4, 1.0/1.0 GPU_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 accelerator_type:T4, 0.0/1000.0 bundle_group_0_c9c5bfa85076b90014e73f72213fa4c4, 0.0/1000.0 bundle_group_c9c5bfa85076b90014e73f72213fa4c4, 0.0/3.0 example-resource-a, 0.0/3.0 example-resource-b, 0.0 GiB/93.8 GiB memory, 0.13 GiB/40.01 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
Worker node types:
 - worker-node: 3

amholler commented 2 years ago

Maybe somehow after the removal of the idle ip, this if condition gets stuck triggering on updates for the remaining ips, until the Ray Tune job ends:

        if ip not in self.last_used_time_by_ip or \
                self.static_resources_by_ip[ip] != \
                self.dynamic_resources_by_ip[ip]:
            self.last_used_time_by_ip[ip] = now

amholler commented 2 years ago

One additional thing about my setup I just thought of is described below.

The Ray cluster deployed on the Kubernetes cluster is Ray 1.9.1, which is in the docker image. However, I am now reflecting on the fact that in a Kubernetes cluster, the ray operator contains the Ray Autoscaler. The ray operator I am running in the K8s cluster was deployed 51 days ago from master.

Whenever I want to deploy a new ray cluster, I undeploy/redeploy the existing ray cluster, but I leave the ray operator running as is.

I'm not sure how/whether the behavior I am seeing in this bug could be caused by this, but I wanted to mention it. LMK if you think I should change something and rerun.

DmitriGekhtman commented 2 years ago

Updating the operator image is a good idea. Could you share the configuration you are using to deploy?

amholler commented 2 years ago

diff --git a/deploy/components/example_cluster.yaml b/deploy/components/example_cluster.yaml
index 1513e8fde..d4359fb77 100644
--- a/deploy/components/example_cluster.yaml
+++ b/deploy/components/example_cluster.yaml
@@ -4,14 +4,14 @@ metadata:
   name: example-cluster
 spec:
   # The maximum number of workers nodes to launch in addition to the head node.
-  maxWorkers: 3
+  maxWorkers: 9
   # The autoscaler will scale up the cluster faster with higher upscaling speed.
   # E.g., if the task requires adding more nodes then autoscaler will gradually
   # scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
   # This number should be > 0.
   upscalingSpeed: 1.0
   # If a node is idle for this many minutes, it will be removed.
-  idleTimeoutMinutes: 5
+  idleTimeoutMinutes: 3
   # Specify the pod type for the ray head node (as configured below).
   headPodType: head-node
   # Optionally, configure ports for the Ray head service.
@@ -35,6 +35,8 @@ spec:
       metadata:
         # The operator automatically prepends the cluster name to this field.
         generateName: ray-head-
+        annotations:
+          node.elotl.co/instance-family-exclusions: "g3,g3s"
       spec:
         restartPolicy: Never

@@ -48,11 +50,25 @@ spec:
         containers:
         - name: ray-node
           imagePullPolicy: Always
-          image: rayproject/ray:latest
+          # image: rayproject/ray:latest
+          image: ludwigai/ludwig-ray-gpu:tf-legacy
           # Do not change this command - it keeps the pod alive until it is
           # explicitly killed.
           command: ["/bin/bash", "-c", "--"]
           args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
+          env:
+          - name: AWS_ACCESS_KEY_ID
+            value: "<censored>"
+          - name: AWS_SECRET_ACCESS_KEY
+            value: "<censored>"
+          - name: AWS_DEFAULT_REGION
+            value: "us-west-2"
+          - name: TUNE_TRIAL_STARTUP_GRACE_PERIOD
+            value: "120.0"
+          - name: TUNE_TRIAL_RESULT_WAIT_TIME_S
+            value: "120"
+          - name: TUNE_STATE_REFRESH_PERIOD
+            value: "5"
           ports:
           - containerPort: 6379  # Redis port
           - containerPort: 10001  # Used by Ray Client
@@ -67,9 +83,9 @@ spec:
             name: dshm
           resources:
             requests:
-              cpu: 1000m
-              memory: 512Mi
-              ephemeral-storage: 1Gi
+              cpu: 7
+              memory: 50Gi
+              ephemeral-storage: 64Gi
             limits:
               # The maximum memory that this pod is allowed to use. The
               # limit will be detected by ray and split to use 10% for
@@ -78,21 +94,25 @@ spec:
               # the object store size is not set manually, ray will
               # allocate a very large object store in each pod that may
               # cause problems for other pods.
-              memory: 512Mi
+              memory: 50Gi
   - name: worker-node
     # Minimum number of Ray workers of this Pod type.
-    minWorkers: 2
+    minWorkers: 0
     # Maximum number of Ray workers of this Pod type. Takes precedence over minWorkers.
-    maxWorkers: 3
+    maxWorkers: 9
     # User-specified custom resources for use by Ray.
     # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
     rayResources: {"example-resource-a": 1, "example-resource-b": 1}
+    setupCommands:
+    - pip install boto3==1.17.106 awscli==1.19.106 botocore==1.20.106 s3fs==2021.10.0 aiobotocore==1.4.2 s3transfer==0.4.0 fsspec==2021.10.0
     podConfig:
       apiVersion: v1
       kind: Pod
       metadata:
         # The operator automatically prepends the cluster name to this field.
         generateName: ray-worker-
+        annotations:
+          node.elotl.co/instance-family-exclusions: "g3,g3s"
       spec:
         restartPolicy: Never
         volumes:
@@ -102,9 +122,23 @@ spec:
         containers:
         - name: ray-node
           imagePullPolicy: Always
-          image: rayproject/ray:latest
+          # image: rayproject/ray:latest
+          image: ludwigai/ludwig-ray-gpu:tf-legacy
           command: ["/bin/bash", "-c", "--"]
           args: ["trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;"]
+          env:
+          - name: AWS_ACCESS_KEY_ID
+            value: "<censored>"
+          - name: AWS_SECRET_ACCESS_KEY
+            value: "<censored>"
+          - name: AWS_DEFAULT_REGION
+            value: "us-west-2"
+          - name: TUNE_TRIAL_STARTUP_GRACE_PERIOD
+            value: "120.0"
+          - name: TUNE_TRIAL_RESULT_WAIT_TIME_S
+            value: "120"
+          - name: TUNE_STATE_REFRESH_PERIOD
+            value: "5"
           # This volume allocates shared memory for Ray to use for its plasma
           # object store. If you do not provide this, Ray will fall back to
           # /tmp which cause slowdowns if is not a shared memory volume.
@@ -113,9 +147,9 @@ spec:
             name: dshm
           resources:
             requests:
-              cpu: 1000m
-              memory: 512Mi
-              ephemeral-storage: 1Gi
+              cpu: 7
+              memory: 28Gi
+              ephemeral-storage: 64Gi
             limits:
               # The maximum memory that this pod is allowed to use. The
               # limit will be detected by ray and split to use 10% for
@@ -124,7 +158,8 @@ spec:
               # the object store size is not set manually, ray will
               # allocate a very large object store in each pod that may
               # cause problems for other pods.
-              memory: 512Mi
+              memory: 28Gi
+              nvidia.com/gpu: 1 # requesting 1 GPU
   # Commands to start Ray on the head node. You don't need to change this.
   # Note dashboard-host is set to 0.0.0.0 so that Kubernetes can port forward.
   headStartRayCommands:

DmitriGekhtman commented 2 years ago

https://github.com/ray-project/ray/issues/21870#issuecomment-1023342183 The idle time of the unused node is 184 (just over the timeout of 3 minutes). There are 5 nodes in the cluster (4 workers plus 1 head). The reported mean of 37 is roughly 184 / 5, suggesting that all of the nodes that were ever involved in the workload were considered active. So at least we know there was no surprising interaction between the scale-down of a worker and the registered idle times of the rest of the workers.

DmitriGekhtman commented 2 years ago

Maybe somehow after the removal of the idle ip, this if condition gets stuck triggering on updates for the remaining ips, until the Ray Tune job ends:
        if ip not in self.last_used_time_by_ip or \
                self.static_resources_by_ip[ip] != \
                self.dynamic_resources_by_ip[ip]:
            self.last_used_time_by_ip[ip] = now

This is where the last used time is determined.

Explanation:

"static resources" means "total available node resources" "dynamic resources" mean "currently available resources" (resources not currently in use by a Ray workload)

So the second condition in the "or" checks if there are resources that are currently in use. The first condition just marks the node as in use if we've never seen it before. That happens once per node lifetime.

The second condition in the "or" is likely the source of our problems. In other words, the Global Control Store (where the autoscaler polls info from) thinks there are resources in use when we don't expect resources to be in use.

DmitriGekhtman commented 2 years ago

What we actually need to debug this is a periodically emitted node-by-node breakdown of used and available resources.

I don't think Ray has a public API for info at that level of granularity. I'll suggest a patch to get that info in the operator/autoscaler logs.

amholler commented 2 years ago

This is very interesting, thank you @DmitriGekhtman ! Am happy to apply a patch and rerun.

DmitriGekhtman commented 2 years ago

Here's some code that can be added to monitor.py and load_metrics.py in the operator image to get more detailed info in the autoscaler logs https://github.com/ray-project/ray/pull/21978/files

There are many ways to build an operator image with patched Python code. I usually use setup-dev.py as described here.

amholler commented 2 years ago

Thanks, Dmitri!  I reran the single Ray Tune workload with the Ray Operator Autoscaler modifications to get more detailed logging.
Below is a selected time-line with info from the Ray Tune job logging marked "Hyperp" and from the Ray Operator Autoscaler marked "Autosc".
The issue seems to be that the 2 idle nodes that previously ran workers in the Ray Tune job are considered to be in use because they
have copies of the object_store_memory.  But it seems like their copies should be considered uninteresting at this point.
[I can provide the full logs if desired]

********Selected time-line**********
Initial state: CPU head node [10.0.2.144] running

16:18:30 Hyperp started forest_cover 1hr gpu auto train, max_samples=10, max_concurrency=3
16:18:49 Hyperp pending: 1
16:24:50 Hyperp pending: 1, running: 1 [10.0.1.161]
16:30:22 Hyperp pending: 1, running: 2 [10.0.1.161, 10.0.1.107]
16:36:00 Hyperp pending: 0, running: 3 [10.0.1.161, 10.0.1.107, 10.0.3.161], terminated: 1

16:42:40 Autosc adding: 10.0.1.106
16:44:55 Hyperp pending: 0, running: 2 [10.0.1.161, 10.0.3.161],             terminated: 8
16:45:06 Hyperp pending: 0, running: 1 [10.0.1.161],                         terminated: 9
16:46:22 Autosc removed: 10.0.1.106 [never used]

2022-01-30 16:50:06,070 DEBUG load_metrics.py:142 -- Node with ip 10.0.2.144 is in use.
2022-01-30 16:50:06,070 DEBUG load_metrics.py:151 -- Node at ip 10.0.2.144 is using 36035847.0 object_store_memory.

2022-01-30 16:50:06,071 DEBUG load_metrics.py:142 -- Node with ip 10.0.1.161 is in use.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 1.0 GPU.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 4.0 CPU_group_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 1.0 GPU_group_0_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 4.0 CPU.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 1.0 GPU_group_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 4.0 CPU_group_0_9ed9ca343c8c9b3f7e011adedde1a85a.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 36035847.0 object_store_memory.

2022-01-30 16:50:06,071 DEBUG load_metrics.py:142 -- Node with ip 10.0.1.107 is in use.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.107 is using 36035847.0 object_store_memory.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:142 -- Node with ip 10.0.3.161 is in use.
2022-01-30 16:50:06,071 DEBUG load_metrics.py:151 -- Node at ip 10.0.3.161 is using 36035847.0 object_store_memory.

17:18:50 Hyperp pending: 0, running: 0,                                      terminated: 10
17:21    Hyperp stopped
17:24:22 Autosc removed: 10.0.1.161, 10.0.1.107, 10.0.3.161

DmitriGekhtman commented 2 years ago

Thanks, Anne! Looks like we're getting closer to figuring this out.

Some of the Ray worker nodes are unexpectedly kept from scaling down despite not being involved in active Ray workloads. The reason is that they appear to still be using object store memory.

Polling experts on the Ray internals, @rkooo567 @wuisawesome @ericl
What are some potential mitigations or directions for further investigation?

wuisawesome commented 2 years ago

Hmmm seems like there are a few options here (I'm assuming that these nodes aren't storing anything useful in their object store).

Somewhat eagerly flush the object store to bring it back down to 0 utilization.
Report some kind of object ownership metric and have the autoscaler ignore object store utilization.

wuisawesome commented 2 years ago

cc. @scv119 as well

DmitriGekhtman commented 2 years ago

I'm assuming that these nodes aren't storing anything useful in their object store.

Correct.

ericl commented 2 years ago

Hmm do we know what created those objects and what is referencing them? "ray memory" can show you more information on this.

amholler commented 2 years ago

Logging shows the Ray head was using that amount of object_store_memory just after the Ray Tune job started:
example-cluster,default:2022-01-30 16:18:53,019 DEBUG load_metrics.py:151 -- Node at ip 10.0.2.144 is using 36035847.0 object_store_memory.

When each new worker was added and started running a trial, it started showing using the same amount of memory
example-cluster,default:2022-01-30 16:24:53,335 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.161 is using 36035847.0 object_store_memory.
example-cluster,default:2022-01-30 16:30:23,386 DEBUG load_metrics.py:151 -- Node at ip 10.0.1.107 is using 36035847.0 object_store_memory.
example-cluster,default:2022-01-30 16:36:02,283 DEBUG load_metrics.py:151 -- Node at ip 10.0.3.161 is using 36035847.0 object_store_memory.

After Ray Tune job ended at 17:21, the head and the 3 worker nodes stopped using any object_store_memory
At that point, 2 of the 3 workers had actually been idle since around 16:45.

DmitriGekhtman commented 2 years ago

@richardliaw @xwjiang2010 Looking again at the summary here. At 16:45:06 the nodes at ips 10.0.1.161, 10.0.3.161 have finished their trials, yet they seem to continue to be using object store until 17:21 when the whole Tune run stops. That prevents the nodes from timely downscaling.

The amount of object store memory on the stuck workers 10.0.3.161 10.0.1.107 is the same 36035847.0 as on the head node 10.0.2.144, suggesting that the objects preventing scale-down could be copies of the same Tune-related artifacts.

Do you guys know if there's anything in Tune that could cause this?

ericl commented 2 years ago

I think it could be an artifact of reference counting in Ray. It sounds like all the trials are referencing a shared copy of an object (this 36MB object). However, Ray doesn't evict secondary copies of an object until all references fall out of scope. So the nodes are reporting memory in use, when the object could be evicted safely in this case.

There are some potential fixes:

Eagerly evict secondary copies of objects on idle nodes.
Don't report memory usage of secondary objects for the purposes of autoscaler downscaling.

scv119 commented 2 years ago

Don't report memory usage of secondary objects for the purposes of autoscaler downscaling.

I vote for this approach.

DmitriGekhtman commented 2 years ago

^ Sounds like the correct solution. Is it straightforward to implement? Would it be difficult to distinguish secondary objects?

DmitriGekhtman commented 2 years ago

hah, prompt answer in <20 lines of code

amholler commented 2 years ago

Thanks for this fix! I am trying to test the fix in my setup, which is deployment of Ray onto a K8s cluster running ludwig automl. The image I deploy for the head and workers is ludwigai/ludwig-ray-gpu:tf-legacy. @tgaddair mentioned that I could add

    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

to the worker setupCommands to load in the latest wheels but when I do that, the workers don't come up; they report

(base) MacBook-Air-2:trips anne$ kubectl logs -f example-cluster-ray-worker-zrv4q 
2022-02-02 09:49:07,438 INFO scripts.py:862 -- Local node IP: 10.0.1.248
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2214, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 791, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 865, in start
    ray_params, head=False, shutdown_at_exit=block, spawn_reaper=block
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 189, in __init__
    "session_name", ray_constants.KV_NAMESPACE_SESSION
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 1540, in _internal_kv_get_with_retry
    f"Could not read '{key}' from GCS (redis). "
RuntimeError: Could not read 'b'session_name'' from GCS (redis). If using Redis, did Redis start successfully?

amholler commented 2 years ago

Does anyone know what might be going wrong and/or other ways I can run with this great change?

ericl commented 2 years ago

Did you also add it to the head setup commands? If that still doesn't work, it might be due to the Ray operator / image being out of date, cc @mwtian on possible redis-removal related compatibility issues with upgrading to nightly.

amholler commented 2 years ago

Head setup commands are not supported for Ray deployment on K8s, but I did run the head pip setup command manually and saw the same problem on the head itself, before any workers were even added for scaling.

In fact, after I saw that the head didn't work if I used that pip install, I stopped the ray cluster, redeployed it without running the pip command on the head, and only ran the setup command on the workers and they had the same problem that I'd already seen on the head.

mwtian commented 2 years ago

What autoscaler version is running? It is possible that an earlier version (1.9?) of autoscaler is not launching the worker correctly by setting the correct flags. Are you able to see the flags that worker nodes are launched with?

amholler commented 2 years ago

The ray-operator node in my k8s cluster (which contains autoscaler) is deployed with image: rayproject/ray:nightly so I assume it is very up to date since I deployed it about 4 hours ago.

What do you mean by "flags that worker nodes are launched with"? Is it in the autoscaler log? Since the problem can be reproduced on the head itself, we could focus there? The head starts up fine and then immediately fails once the update pip install is executed on the head.

amholler commented 2 years ago

I can crank up a cluster with minWorkers=0, and the head comes up fine. Then on the head, if I execute the pip install for the updated Ray, the head starts complaining with the error message

amholler commented 2 years ago

Here's the tasks that were running on the head

(base) ray@example-cluster-ray-head-gnh2g:/ludwig$ ps auxw
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ray          1  0.0  0.0  20132  3476 ?        Ss   09:29   0:00 /bin/bash -c -- trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;
ray          8  0.0  0.0   6324   832 ?        S    09:29   0:00 tail -f /tmp/raylogs
ray        153  0.1  0.0  74944  7424 ?        Sl   09:29   0:00 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
ray        158  0.0  0.0  74944  7156 ?        Sl   09:29   0:00 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:58932
ray        163  0.9  0.0 1029800 27588 ?       Sl   09:29   0:01 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=10.0.6.166 --redis_po
ray        180  1.1  0.1 766424 85360 ?        Sl   09:29   0:01 /home/ray/anaconda3/bin/python -m ray.util.client.server --redis-address=10.0.6.166:6379 --host=0.0.0.0 --port=10001 
ray        189  1.3  0.2 4390068 141256 ?      Sl   09:29   0:01 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=0
ray        223  0.8  0.0 17337936 22428 ?      Sl   09:29   0:00 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_
ray        230  0.8  0.1 310732 81592 ?        Sl   09:29   0:00 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py --redis
ray        261  1.5  0.2 3862168 138876 ?      Sl   09:29   0:01 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-ad
ray        409  0.1  0.0  20396  3836 pts/0    Ss   09:31   0:00 bash
ray        443  0.0  0.0  36164  3336 pts/0    R+   09:31   0:00 ps auxw

amholler commented 2 years ago

I just noticed this in the ray log

Ray client servers support --address flag instead of --redis-address.

Does that mean the ray head client command is no longer working since it uses --redis-address?

DmitriGekhtman commented 2 years ago

I'd recommend if possible running head and worker images with the changes built-in rather than running pip install at runtime or in setup commands.

amholler commented 2 years ago

Here is autoscaler logging for a worker startup

2022-02-02 09:50:28,732 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Got remote shell  [LogTimer=31832ms]
2022-02-02 09:50:28,750 INFO updater.py:374 -- Updating cluster configuration.^[[0m^[[2m [hash=18639bb6a3ba134e2e4110e55ab38ba97a8dcd39]^[[22m^[[0m
2022-02-02 09:50:28,794 INFO updater.py:380 -- ^[[37mNew status^[[39m: ^[[1msyncing-files^[[22m
2022-02-02 09:50:28,794 INFO updater.py:238 -- ^[[2m[2/7]^[[22m ^[[36mProcessing file mounts^[[39m
2022-02-02 09:50:28,794 INFO updater.py:256 -- ^[[2m[3/7]^[[22m No worker file mounts to sync
2022-02-02 09:50:28,840 INFO updater.py:391 -- ^[[37mNew status^[[39m: ^[[1msetting-up^[[22m
2022-02-02 09:50:28,840 INFO updater.py:434 -- ^[[2m[4/7]^[[22m No initialization commands to run.
2022-02-02 09:50:28,840 INFO updater.py:439 -- ^[[2m[5/7]^[[22m ^[[36mInitalizing command runner^[[39m
2022-02-02 09:50:28,841 INFO updater.py:450 -- ^[[2m[6/7]^[[22m ^[[36mRunning setup commands^[[39m
2022-02-02 09:50:28,841 INFO updater.py:470 -- ^[[2m(0/2)^[[22m ^[[1mexport RAY_HEAD_IP=10.0.6.166; pip install boto3==1.17.106 awscli==1.19.106 botocore==1.20.106 s3fs==2021.10.0 aiobotocore==1.4.2 s3transfer==0.4.0 fsspec==2021.10.0^[[22m^[[26m
2022-02-02 09:50:38,921 INFO updater.py:470 -- ^[[2m(1/2)^[[22m ^[[1mexport RAY_HEAD_IP=10.0.6.166; pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl^[[22m^[[26m
2022-02-02 09:50:46,593 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Setup commands succeeded [LogTimer=17752ms]
2022-02-02 09:50:46,594 INFO updater.py:489 -- ^[[2m[7/7]^[[22m ^[[36mStarting the Ray runtime^[[39m
2022-02-02 09:51:29,755 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Ray start commands failed [LogTimer=43161ms]
2022-02-02 09:51:29,755 INFO log_timer.py:30 -- NodeUpdater: example-cluster-ray-worker-hks5g: Applied config 18639bb6a3ba134e2e4110e55ab38ba97a8dcd39  [LogTimer=92912ms]
2022-02-02 09:51:29,811 ERR updater.py:157 -- ^[[31mNew status: ^[[1mupdate-failed^[[22m^[[26m^[[39m
2022-02-02 09:51:29,811 ERR updater.py:159 -- ^[[31m!!!^[[39m
2022-02-02 09:51:29,811 ERR updater.py:164 -- ^[[31mSetup command `^[[1mkubectl -n default exec -it example-cluster-ray-worker-hks5g -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":7,"GPU":1,"example-resource-a":1,"example-resource-b":1,"memory":21045339750}'"'"';export RAY_HEAD_IP=10.0.6.166; ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 &> /tmp/raylogs)'^[[22m^[[26m` failed with exit code 1. stderr:^[[39m
2022-02-02 09:51:29,811 ERR updater.py:171 -- ^[[31m!!!^[[39m
2022-02-02 09:51:38,319 INFO updater.py:323 -- ^[[37mNew status^[[39m: ^[[1mwaiting-for-ssh^[[22m
2022-02-02 09:51:38,320 INFO updater.py:261 -- ^[[2m[1/7]^[[22m ^[[36mWaiting for SSH to become available^[[39m
2022-02-02 09:51:38,320 INFO updater.py:265 -- Running `^[[1muptime^[[22m^[[26m` as a test.

amholler commented 2 years ago

Installing collected packages: ray
  Attempting uninstall: ray
    Found existing installation: ray 1.9.2
    Uninstalling ray-1.9.2:
      Successfully uninstalled ray-1.9.2
Successfully installed ray-2.0.0.dev0
example-cluster,default:2022-02-02 09:50:46,594 INFO command_runner.py:179 -- NodeUpdater: example-cluster-ray-worker-hks5g: Running kubectl -n default exec -it example-cluster-ray-worker-hks5g -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":7,"GPU":1,"example-resource-a":1,"example-resource-b":1,"memory":21045339750}'"'"';export RAY_HEAD_IP=10.0.6.166; ray stop)'
Unable to use a TTY - input is not a terminal or the right kind of file
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

mwtian commented 2 years ago

Thanks for the detailed info. Looks like the failure is indeed from autoscaler trying to start workers.

I can crank up a cluster with minWorkers=0, and the head comes up fine. Then on the head, if I execute the pip install for the updated Ray, the head starts complaining with the error message

When executing pip install for the updated Ray, is the head node still running? Is the same error message generated, e.g. RuntimeError: Could not read 'b'session_name'' from GCS (redis). If using Redis, did Redis start successfully?? I wonder why autoscaler is trying to start a worker in this case.

Does that mean the ray head client command is no longer working since it uses --redis-address?

In latest Ray, Ray client server accepts --address instead of --redis-address. Other components in Ray have been updated to use --address as well. The Ray client servers support ... message does not seem to be in the latest Ray repo. I wonder where the error log comes from.

amholler commented 2 years ago

Here is a more methodical set of steps, which also pulls the autoscaler out of the picture.

*) I started up a fresh 1.9.2 Ray cluster on a k8s cluster with a cpu-only
   head and gpu workers, as in the previous case but with minWorkers=1

*) The head and 1 worker came up fine.  I logged into each and verified
   "ray --version" was 1.9.2 and "ray status" looked good.

*) Then I ran the following update on the head and on the worker:
   "pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

*) Then I checked "ray --version" was 2.0.0.dev0" and "ray status" was good on head & worker.

*) Then I started a Ludwig AutoML job, which uses Ray Tune, with max_concurrency 1:
   nohup python run_auto_train_1hr_nodeless.py >run.1hr 2>&1 &

*) And the output produced was:
(base) ray@example-cluster-ray-head-57vww:~/experiments/automl/validation/forest_cover$ cat run.1hr 
nohup: ignoring input
covtype.data.gz: 10.7MB [00:00, 18.8MB/s]                            
Traceback (most recent call last):
  File "run_auto_train_1hr_nodeless.py", line 15, in <module>
    user_config={'hyperopt': {'executor': {'max_concurrent_trials': 1, 'gpu_resources_per_trial': 1, 'cpu_resources_per_trial': 4}}},
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 108, in auto_train
    dataset, target, time_limit_s, tune_for_memory, user_config, random_seed, use_reference_config, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 150, in create_auto_config
    default_configs = _create_default_config(dataset, target, time_limit_s, random_seed)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/base_config.py", line 135, in _create_default_config
    _ray_init()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/utils.py", line 88, in _ray_init
    ray.init('auto', ignore_reinit_error=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1107, in init
    connect_only=True,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 190, in __init__
    "session_name", ray_constants.KV_NAMESPACE_SESSION
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/node.py", line 1542, in _internal_kv_get_with_retry
    f"Could not read '{key}' from GCS (redis). "
RuntimeError: Could not read 'b'session_name'' from GCS (redis). If using Redis, did Redis start successfully?

*) Here is the ludwig program I ran:
(base) ray@example-cluster-ray-head-57vww:~/experiments/automl/validation/forest_cover$ cat run_auto_train_1hr_nodeless.py
import logging
import pprint

from ludwig.datasets import forest_cover
from ludwig.automl import auto_train

forest_cover_df = forest_cover.load()

auto_train_results = auto_train(
    dataset=forest_cover_df,
    target='Cover_Type',
    time_limit_s=3600,
    tune_for_memory=False,
    output_directory='s3://predibase-runs/nodeless/forest_cover/hours1/',
    user_config={'hyperopt': {'executor': {'max_concurrent_trials': 1, 'gpu_resources_per_trial': 1, 'cpu_resources_per_trial': 4}}},
)

pprint.pprint(auto_train_results)

amholler commented 2 years ago

Here's what's running on the head

(base) ray@example-cluster-ray-head-57vww:~/experiments/automl/validation/forest_cover$ ps auxfww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ray        408  0.0  0.0  20396  3892 pts/0    Ss   14:32   0:00 bash
ray        533  0.0  0.0  36164  3332 pts/0    R+   15:02   0:00  \_ ps auxfww
ray          1  0.0  0.0  20132  3552 ?        Ss   13:33   0:00 /bin/bash -c -- trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;
ray          8  0.0  0.0   6324   868 ?        S    13:33   0:00 tail -f /tmp/raylogs
ray        152  0.1  0.0  74944  7516 ?        Sl   13:33   0:09 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
ray        157  0.0  0.0  81088  7200 ?        Sl   13:33   0:03 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:49684
ray        162  1.3  0.0 1029800 28052 ?       Sl   13:33   1:11 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=10.0.5.197 --redis_port=6379 --log_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --config_list=eyJvYmplY3Rfc3BpbGxpbmdfY29uZmlnIjogIntcInR5cGVcIjogXCJmaWxlc3lzdGVtXCIsIFwicGFyYW1zXCI6IHtcImRpcmVjdG9yeV9wYXRoXCI6IFwiL3RtcC9yYXkvc2Vzc2lvbl8yMDIyLTAyLTAyXzEzLTMzLTIyXzA2MzAwNF8xNDhcIn19IiwgImlzX2V4dGVybmFsX3N0b3JhZ2VfdHlwZV9mcyI6IHRydWV9 --gcs_server_port=0 --metrics-agent-port=52535 --node-ip-address=10.0.5.197 --redis_password=5241590000000000
ray        179  0.0  0.1 766424 85352 ?        Sl   13:33   0:04 /home/ray/anaconda3/bin/python -m ray.util.client.server --redis-address=10.0.5.197:6379 --host=0.0.0.0 --port=10001 --mode=proxy --redis-password=5241590000000000 --metrics-agent-port=52535 --worker-shim-pid=179
ray        188  0.5  0.2 4390200 141664 ?      Sl   13:33   0:29 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --redis-address=10.0.5.197:6379 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000
ray        222  0.8  0.0 17338196 22816 ?      Sl   13:33   0:44 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --store_socket_name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.0.5.197 --redis_address=10.0.5.197 --redis_port=6379 --maximum_startup_concurrency=7 --static_resource_list=node:10.0.5.197,1.0,CPU,7,memory,37580963840,object_store_memory,16083075072 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/setup_worker.py /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.5.197 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --redis-address=10.0.5.197:6379 --temp-dir=/tmp/ray --metrics-agent-port=52535 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000 --java_worker_command= --cpp_worker_command= --native_library_path=/home/ray/anaconda3/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --log_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --resource_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --metrics-agent-port=52535 --metrics_export_port=62075 --object_store_memory=16083075072 --plasma_directory=/dev/shm --ray-debugger-external=0 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.0.5.197 --redis-address=10.0.5.197:6379 --metrics-export-port=62075 --dashboard-agent-port=52535 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --runtime-env-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
ray        260  0.6  0.2 3862572 138620 ?      Sl   13:33   0:32  \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.0.5.197 --redis-address=10.0.5.197:6379 --metrics-export-port=62075 --dashboard-agent-port=52535 --listen-port=0 --node-manager-port=43305 --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --runtime-env-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
ray        229  0.3  0.1 310732 81616 ?        Sl   13:33   0:17 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py --redis-address=10.0.5.197:6379 --logs-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000