Closed hartikainen closed 5 years ago
I think the trouble starts even earlier, at line 165825: StandardAutoscaler: Terminating idle node: ray-humanoid-default-2-worker-3819cb87
StandardAutoscaler [2018-12-09 03:54:33.664806]: 12/11 target nodes (0 pending)
- NodeIdleSeconds: Min=0 Mean=27217 Max=117963
- NumNodesConnected: 13
- NumNodesUsed: 9.0
- ResourceUsage: 144.0/208.0 b'CPU', 0.0/0.0 b'GPU'
- TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
StandardAutoscaler: Terminating idle node: ray-humanoid-default-2-worker-3819cb87
URL being requested: DELETE https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/instances/ray-humanoid-default-2-worker-3819cb87?alt=json
Waiting for operation operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e to finish...
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
W1209 03:54:44.689539 9036 monitor.cc:48] Client timed out: 4e9c74406c8a509aa5b66f73f003429c43258c4f
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
Done.
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/instances?filter=%28%28labels.ray-node-type+%3D+worker%29%29+AND+%28%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+humanoid-default-2%29&alt=json
StandardAutoscaler [2018-12-09 03:55:05.181084]: 11/11 target nodes (0 pending)
- NodeIdleSeconds: Min=31 Mean=27249 Max=117995
- NumNodesConnected: 13
- NumNodesUsed: 9.0
- ResourceUsage: 144.0/208.0 b'CPU', 0.0/0.0 b'GPU'
- TimeSinceLastHeartbeat: Min=31 Mean=31 Max=31
StandardAutoscaler: No heartbeat from node ray-humanoid-default-2-worker-1c189152 in 31.834206104278564 seconds, restarting Ray to recover...
This is followed by restart requests issued to all the nodes.
It seems like the delete operation stalled the autoscaler for long enough that when it got back all the nodes looked timed out.
A workaround would be to make the terminating node operation non-blocking (I believe in AWS it is). The proper fix is to remove the heartbeat tracking from the autoscaler and instead rely on the actual health status reported from the C++ code. @robertnishihara you brought this up before, is there a Python API / channel to subscribe to the canonical node health statuses?
Doesn't fully work yet. But will be like this (notes inline).
redis_client = ...
subscribe_client = redis_client.pubsub(ignore_subscribe_messages=True)
subscribe_client.subscribe(ray.gcs_utils.CLIENT_CHANNEL)
# Note that the above needs to be defined as
# CLIENT_CHANNEL = str(TablePubsub.CLIENT).encode("ascii")
# in gcs_utils.py.
message = subscribe_client.get_message()
# Wait 10 seconds for the node to be marked dead.
import time
time.sleep(11)
# Code for parsing this is similar to the code implementing ray.global_state.client_table()
gcs_entry = ray.gcs_utils.GcsTableEntry.GetRootAsGcsTableEntry(message["data"], 0)
# Get last entry.
client_info = ray.gcs_utils.ClientTableData.GetRootAsClientTableData(gcs_entry.Entries(gcs_entry.EntriesLength() - 1), 0)
if client_info.IsInsertion():
# Ignore this case because a node was added not removed.
pass
else:
# A node was removed.
# Which node was removed? Unfortunately this doesn't work yet.
# I'll fix this in https://github.com/ray-project/ray/pull/3370.
client_info.NodeManagerAddress()
client_info.NodeManagerPort()
System information
Describe the problem
Just run a longer run on gcp cluster and saw my whole cluster going down (possibly) after failed nodes() request in the node manager.
Source code / logs
Zipped
/tmp/ray
: https://drive.google.com/file/d/18uyg7Xbc0xUgsNKIvNgSy9KtrM1TZGqB/view?usp=sharing.I think the error begins from line
166256
inmonitor.err
:That seems to escalate to things like: