ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.93k stars 5.57k forks source link

[Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies #21870

Closed amholler closed 2 years ago

amholler commented 2 years ago

Search before asking

Ray Component

Ray Tune

What happened + What you expected to happen

Ray Autoscaler is not spinning down idle nodes if they ever ran a trial for the active Ray Tune job

The issue is seen on Ray Tune on a CPU-head w/GPU-workers (min=0,max=9) Ray 1.9.1 cluster.
The Higgs Ray Tune job is set up to run up to 10 trials using async hyperband for 1hr with
max_concurrency of 3.  I see at most 3 trials running (each requiring 1 gpu and 4 cpus).
Except in the first logging at the job startup, no PENDING trials are reported.

At the time the Ray Tune job is stopped for the 1hr time limit at 12:43:48, the console log (see below) shows:
*) 3 nodes running Higgs trials (10.0.4.71, 10.0.6.5, 10.0.4.38)
*) 2 nodes that previously ran Higgs trials but are not doing so now (10.0.2.129, 10.0.3.245).
The latter 2 nodes last reported running trials at 12:21:30, so they should be spun down.

Note that, in this run, multiple Ray Tune jobs were running in the same Ray cluster with some overlap:
 MushroomEdibility Ray Tune 1hr job ran from 11:20-l2:20
 ForestCover       Ray tune 1hr job ran from 11:22-12:22
 Higgs             Ray Tune 1hr job ran from 11:45-12:45
After 12:22, there was no overlap of jobs and hence no other use of 2 idle workers that remained on other than the historical Higgs use.
Two other nodes that became idle when MushroomEdibility and ForestCover completed were spun down at that point, leaving the other 2 idle nodes that higgs had used running.
In the same kind of scenario later in the run, I observed that after the Higgs job completed, all Higgs trial workers were spun down.

Current time: 2022-01-24 12:43:48 (running for 00:59:30.27) ...
Number of trials: 9/10 (3 RUNNING, 6 TERMINATED)

+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
| Trial name     | status     | loc             |   combiner.bn_momentum |   combiner.bn_virtual_bs |   combiner.num_steps |   combiner.output_size |   combiner.relaxation_factor |   combiner.size |   combiner.sparsity |   training.batch_size |   training.decay_rate |   training.decay_steps |   training.learning_rate |   iter |   total time (s) |   metric_score |
|----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------|
| trial_04bb0f22 | RUNNING    | 10.0.4.71:1938  |                   0.7  |                     2048 |                    4 |                      8 |                          1   |              32 |              0.0001 |                  8192 |                  0.95 |                    500 |                    0.025 |     18 |         3566.91  |       0.489641 |
| trial_3787a9c4 | RUNNING    | 10.0.6.5:17263  |                   0.9  |                     4096 |                    7 |                     24 |                          1.5 |              64 |              0      |                   256 |                  0.9  |                  10000 |                    0.01  |        |                  |                |
| trial_39a0ad6e | RUNNING    | 10.0.4.38:8657  |                   0.8  |                      256 |                    3 |                     16 |                          1.2 |              64 |              0.001  |                  4096 |                  0.95 |                   2000 |                    0.005 |      4 |         1268.2   |       0.50659  |
| trial_05396980 | TERMINATED | 10.0.2.129:2985 |                   0.8  |                      256 |                    9 |                    128 |                          1   |              32 |              0      |                  2048 |                  0.95 |                  10000 |                    0.005 |      1 |          913.295 |       0.53046  |
| trial_059befa6 | TERMINATED | 10.0.3.245:282  |                   0.98 |                     1024 |                    3 |                      8 |                          1   |               8 |              1e-06  |                  1024 |                  0.8  |                    500 |                    0.005 |      1 |          316.455 |       0.573849 |
| trial_c433a60c | TERMINATED | 10.0.3.245:281  |                   0.8  |                     1024 |                    7 |                     24 |                          2   |               8 |              0.001  |                   256 |                  0.95 |                  20000 |                    0.01  |      1 |         1450.99  |       0.568653 |
| trial_277d1a8a | TERMINATED | 10.0.4.38:8658  |                   0.9  |                      256 |                    5 |                     64 |                          1.5 |              64 |              0.0001 |                   512 |                  0.95 |                  20000 |                    0.005 |      1 |          861.914 |       0.56506  |
| trial_26f6b0b0 | TERMINATED | 10.0.2.129:3079 |                   0.6  |                      256 |                    3 |                     16 |                          1.2 |              16 |              0.01   |                  1024 |                  0.9  |                   8000 |                    0.005 |      1 |          457.482 |       0.56582  |
| trial_2acddc5e | TERMINATED | 10.0.3.245:504  |                   0.6  |                      512 |                    5 |                     32 |                          2   |               8 |              0      |                  2048 |                  0.95 |                  10000 |                    0.025 |      1 |          447.483 |       0.594953 |
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+

Versions / Dependencies

Ray 1.9.1

Reproduction script

https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.

Anything else

This problem is highly reproducible for me.

Are you willing to submit a PR?

amholler commented 2 years ago

Here's what's running on the worker

(base) ray@example-cluster-ray-worker-twfmp:/ludwig$ ps auxfww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ray        315  0.0  0.0  20396  3944 pts/0    Ss   14:33   0:00 bash
ray        387  0.0  0.0  36164  3292 pts/0    R+   15:03   0:00  \_ ps auxfww
ray          1  0.0  0.0  20132  3468 ?        Ss   13:39   0:00 /bin/bash -c -- trap : TERM INT; touch /tmp/raylogs; tail -f /tmp/raylogs; sleep infinity & wait;
ray          8  0.0  0.0   6332   828 ?        S    13:39   0:00 tail -f /tmp/raylogs
ray        216  0.6  0.0 10379972 22400 ?      Sl   13:39   0:32 /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --store_socket_name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.0.2.206 --redis_address=10.0.5.197 --redis_port=6379 --maximum_startup_concurrency=7 --static_resource_list=example-resource-a,1,example-resource-b,1,node:10.0.2.206,1.0,accelerator_type:T4,1,CPU,7,GPU,1,memory,21045339750,object_store_memory,8957975347 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/setup_worker.py /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.2.206 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --redis-address=10.0.5.197:6379 --temp-dir=/tmp/ray --metrics-agent-port=62929 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --redis-password=5241590000000000 --java_worker_command= --cpp_worker_command= --native_library_path=/home/ray/anaconda3/lib/python3.7/site-packages/ray/cpp/lib --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --log_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --resource_dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --metrics-agent-port=62929 --metrics_export_port=44729 --object_store_memory=8957975347 --plasma_directory=/dev/shm --ray-debugger-external=0 --agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.0.2.206 --redis-address=10.0.5.197:6379 --metrics-export-port=44729 --dashboard-agent-port=62929 --listen-port=0 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --runtime-env-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
ray        259  0.5  0.4 3869180 142784 ?      Sl   13:39   0:29  \_ /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agent.py --node-ip-address=10.0.2.206 --redis-address=10.0.5.197:6379 --metrics-export-port=44729 --dashboard-agent-port=62929 --listen-port=0 --node-manager-port=38373 --object-store-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-02-02_13-33-22_063004_148/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148 --runtime-env-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/runtime_resources --log-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000
ray        223  0.2  0.2 310668 81604 ?        Sl   13:39   0:12 /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py --redis-address=10.0.5.197:6379 --logs-dir=/tmp/ray/session_2022-02-02_13-33-22_063004_148/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password 5241590000000000
fishbone commented 2 years ago

@amholler have you restarted ray cluster after the upgrading? If failing to do that, in the end, you'll have the driver and cluster running in different ray versions.

From the log, it said failed to fetch the key, and usually, it happens when the ray version and autoscaler version are different.

*) Then I ran the following update on the head and on the worker:
   "pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

This command needs to be run in setupCommands: I think. Basically, the ray version needs to be upgraded before the "ray cluster" started.

amholler commented 2 years ago

I did not restart the Ray Cluster after upgrading. But "ray --version" showed the expected version version 2.0.0.dev0 after upgrading on both the head and the worker.

In a k8s deployment, the autoscaler is running in a separate operator node; that node shows that it is running version 2.0.0.dev0 as well.

I can run the pip install in the setupCommands for the workers, but unfortunately, Ray does not support specifying setupCommands for the head when deploying onto K8s. But what I think I can do is bring up the Ray head (with 0 workers), update the head, and then restart the head, if you think that may work.

fishbone commented 2 years ago

I think ray --version is not fetching the ray version running but the version currently installed. For example, you can run ray --version even without the cluster up.

What we need is to make sure all images are running with the same version of ray. I think you can give it a try. But still, it'll be better if we can make them run with the same version of ray at the beginning.

amholler commented 2 years ago

Thanks @iycheng let me see if I can use the head restart trick (and BTW it would awesome if Ray supported setupCommands for the head node for deployment in K8s clusters, like is done for deployment to AWS VMs)

mwtian commented 2 years ago

Yes, please let us know if restarting the head node helps. I think it will most likely fix the issue. I just reproduced the issue with ray start --head with Ray 1.9.2, install Ray nightly, and run ray start --address=127.0.0.1:6379.

In future with GCS bootstrapping, it should be possible for us to add version checks when starting a worker.

amholler commented 2 years ago

Yes, thank you @mwtian and @iycheng that worked! At least it worked fine in my single worker node repro scenario above, so hopefully, it will work in general.

I was able to do the "pip install" for the workers using the "setupCommands:" and I was able to the "pip install" for the head here:

  headStartRayCommands:
    - ray stop
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
    - ulimit -n 65536; ray start --head --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
amholler commented 2 years ago

My experimental run last night with this change looked great, with the Ray Autoscaler scaling down as desired. Thank you very much!!

mwtian commented 2 years ago

Great to hear!

ericl commented 2 years ago

Re-opening since the feature flag has been disabled for now in master.

amholler commented 2 years ago

Hope this functionality can be back on by default soon! It made a huge positive impact on efficiency during Ray Tune async hyperband search. Thanks!!

mwtian commented 2 years ago

Note: fix by Eric in https://github.com/ray-project/ray/pull/22020, turned off in https://github.com/ray-project/ray/pull/22132 because of failure in pipelined_ingestion_1500_gb_15_windows.

ericl commented 2 years ago

Yep for context, in master setting RAY_scheduler_report_pinned_bytes_only=1 fixes the issue. However, it causes the pipelined ingest test to crash for unknown reasons, which is the only thing blocking enabling it by default.

mwtian commented 2 years ago

pipelined_ingestion_1500_gb is still passing after reenabling reporting pinned bytes only. The test running time is ~2600s compared to ~2800s before, not sure if related.

However, it seems autoscaling_shuffle_1tb_1000_partitions started failing by timing out. I'm bisecting but https://github.com/ray-project/ray/pull/22786 is likely the root cause. Error logs seen from driver:

2022-03-03 12:02:02,218 WARNING worker.py:1471 -- Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 894, in ray._raylet.spill_objects_handler
  File "python/ray/_raylet.pyx", line 897, in ray._raylet.spill_objects_handler
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 554, in spill_objects
    return _external_storage.spill_objects(object_refs, owner_addresses)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 303, in spill_objects
    return self._write_multiple_objects(f, object_refs, owner_addresses, url)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/external_storage.py", line 137, in _write_multiple_objects
    raise ValueError(error)
ValueError: object ref 8c5fdac94fbfecf4ffffffffffffffffffffffff0100000047010000 does not exist.
An unexpected internal error occurred while the IO worker was spilling objects: object ref 8c5fdac94fbfecf4ffffffffffffffffffffffff0100000047010000 does not exist.
......
(raylet) [2022-03-03 12:02:02,633 E 206 206] (raylet) local_object_manager.cc:33: Plasma object 4c6774f519f28c64ffffffffffffffffffffffff0100000083030000 was evicted before the raylet could pin it.
......
2022-03-03 11:34:55,042 WARNING worker.py:1471 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 2354d35098fa013a88c00590fbd05c465f36b4ad01000000 Worker ID: ead9645d4fbe2bf614624fb317c8e8a461e1af9186fb4177c380982f Node ID: 4f07e6898076803a928950bc3ce4763929345287b07a78de036889ba Worker IP address: 172.31.80.144 Worker port: 10002 Worker PID: 193
......

Looks like some assumptions in the code are broken. I will revert https://github.com/ray-project/ray/pull/22786 then investigate. Also, chatting with people familiar with the code path would be very helpful.

ericl commented 2 years ago

@mwtian any update on the root cause of this issue?

mwtian commented 2 years ago

Still looking at this. Will have a fix soon.

mwtian commented 2 years ago

Chatted with @stephanie-wang yesterday and learned a lot of useful background about object store. Most likely the issue is that nodes with spilled objects but no pinned objects are reporting 0 pinned memory usage, so after the feature flag is enabled, these nodes get drained and live objects on the nodes are destroyed. https://github.com/ray-project/ray/pull/23425 has a proposed fix.