ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.08k stars 5.6k forks source link

[tune/autoscaler] Cluster restarts after terminating an idle node blocks the autoscaler for >60s #3501

Closed hartikainen closed 5 years ago

hartikainen commented 5 years ago

System information

Describe the problem

Just run a longer run on gcp cluster and saw my whole cluster going down (possibly) after failed nodes() request in the node manager.

Source code / logs

Zipped /tmp/ray: https://drive.google.com/file/d/18uyg7Xbc0xUgsNKIvNgSy9KtrM1TZGqB/view?usp=sharing.

I think the error begins from line 166256 in monitor.err:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 252, in run
    self._launch_node(config, count)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 234, in _launch_node
    before = self.provider.nodes(tag_filters=tag_filters)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/ray/autoscaler/gcp/node_provider.py", line 88, in nodes
    filter=filter_expr,
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/googleapiclient/http.py", line 844, in execute
    method=str(self.method), body=self.body, headers=self.headers)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/googleapiclient/http.py", line 181, in _retry_request
    raise exception
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/googleapiclient/http.py", line 162, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google/auth/credentials.py", line 122, in before_request
    self.refresh(request)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 78, in _retrieve_info
    service_account=self._service_account_email)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 179, in get_service_account_info
    recursive=True)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google/auth/compute_engine/_metadata.py", line 115, in get
    response = request(url=url, method='GET', headers=_METADATA_HEADERS)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/httplib2/__init__.py", line 1924, in request
    cachekey,
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/httplib2/__init__.py", line 1595, in _request
    conn, request_uri, method, body, headers
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/site-packages/httplib2/__init__.py", line 1502, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/ubuntu/anaconda3/envs/softlearning/lib/python3.6/http/client.py", line 986, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

That seems to escalate to things like:

raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
raylet_monitor: no process found
W1209 03:55:40.457123  9036 monitor.cc:48] Client timed out: 8209f70d7dac7d1ff72cecbbfb0b42a1d2fea03c
W1209 03:55:40.457242  9036 monitor.cc:48] Client timed out: a8ccda861864abbb7591fad9a481fc931fd2e85d
W1209 03:55:40.457264  9036 monitor.cc:48] Client timed out: 6e7979f06687be0285d79d36b82886f5739e4bb8
W1209 03:55:40.457294  9036 monitor.cc:48] Client timed out: 6388c2e09208d56654bd132239ab6a1aaa3cda2b
W1209 03:55:40.457314  9036 monitor.cc:48] Client timed out: df722e44ad2b1b307f2fe7e5b977aede804a06e2
W1209 03:55:40.457343  9036 monitor.cc:48] Client timed out: b3e005f19eefc1d8c92889183c54e5e0587ca383
W1209 03:55:40.457381  9036 monitor.cc:48] Client timed out: e9187643c10f71652faf28eb353acb7411b08e58
W1209 03:55:40.557929  9036 monitor.cc:48] Client timed out: 175712e4405c7c862cd4b23d28932c63789dd215
W1209 03:55:40.558032  9036 monitor.cc:48] Client timed out: ac95abd524990bb95aed6b9e1c602c542005d328
W1209 03:55:41.863957  9036 monitor.cc:48] Client timed out: b1455fed6a92b3a630937556670f72c0cde5c034
W1209 03:55:41.964691  9036 monitor.cc:48] Client timed out: cb60b47f6319ae441cb6427232bfc5230a89a87c
W1209 03:55:41.964812  9036 monitor.cc:48] Client timed out: c6b7b6c7b3574e76444459bc100b83765b467fb9
W1209 03:55:41.964834  9036 monitor.cc:48] Client timed out: c1157e7dc5eef9f930b1c506704c1766379a20fd
W1209 03:55:41.964869  9036 monitor.cc:48] Client timed out: 32f155cbd6c2281af5648c818df3855d43c3a268
W1209 03:55:42.065784  9036 monitor.cc:48] Client timed out: 6fad2650b50089f8bb06a45aed40e3c9e3bc6ccb
W1209 03:55:42.065918  9036 monitor.cc:48] Client timed out: 57411fa29f0b599b17ed15d904fba1a0b8b61f79
W1209 03:55:42.166649  9036 monitor.cc:48] Client timed out: a09871dee70a5f62320a8aafc00e5922aa636bc4
W1209 03:55:42.367646  9036 monitor.cc:48] Client timed out: 5185485c2977071b20ef69b1c6d1f2d13fd74def
ericl commented 5 years ago

I think the trouble starts even earlier, at line 165825: StandardAutoscaler: Terminating idle node: ray-humanoid-default-2-worker-3819cb87

StandardAutoscaler [2018-12-09 03:54:33.664806]: 12/11 target nodes (0 pending)
 - NodeIdleSeconds: Min=0 Mean=27217 Max=117963
 - NumNodesConnected: 13
 - NumNodesUsed: 9.0
 - ResourceUsage: 144.0/208.0 b'CPU', 0.0/0.0 b'GPU'
 - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
StandardAutoscaler: Terminating idle node: ray-humanoid-default-2-worker-3819cb87
URL being requested: DELETE https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/instances/ray-humanoid-default-2-worker-3819cb87?alt=json
Waiting for operation operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e to finish...
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
W1209 03:54:44.689539  9036 monitor.cc:48] Client timed out: 4e9c74406c8a509aa5b66f73f003429c43258c4f
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/operations/operation-1544327673674-57c8ece67f910-9e40e51d-63333b5e?alt=json
Done.
URL being requested: GET https://www.googleapis.com/compute/v1/projects/sac-ray-test/zones/us-west1-a/instances?filter=%28%28labels.ray-node-type+%3D+worker%29%29+AND+%28%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+humanoid-default-2%29&alt=json
StandardAutoscaler [2018-12-09 03:55:05.181084]: 11/11 target nodes (0 pending)
 - NodeIdleSeconds: Min=31 Mean=27249 Max=117995
 - NumNodesConnected: 13
 - NumNodesUsed: 9.0
 - ResourceUsage: 144.0/208.0 b'CPU', 0.0/0.0 b'GPU'
 - TimeSinceLastHeartbeat: Min=31 Mean=31 Max=31
StandardAutoscaler: No heartbeat from node ray-humanoid-default-2-worker-1c189152 in 31.834206104278564 seconds, restarting Ray to recover...

This is followed by restart requests issued to all the nodes.

It seems like the delete operation stalled the autoscaler for long enough that when it got back all the nodes looked timed out.

A workaround would be to make the terminating node operation non-blocking (I believe in AWS it is). The proper fix is to remove the heartbeat tracking from the autoscaler and instead rely on the actual health status reported from the C++ code. @robertnishihara you brought this up before, is there a Python API / channel to subscribe to the canonical node health statuses?

robertnishihara commented 5 years ago

Doesn't fully work yet. But will be like this (notes inline).

redis_client = ...
subscribe_client = redis_client.pubsub(ignore_subscribe_messages=True)
subscribe_client.subscribe(ray.gcs_utils.CLIENT_CHANNEL)

# Note that the above needs to be defined as
# CLIENT_CHANNEL = str(TablePubsub.CLIENT).encode("ascii")
# in gcs_utils.py.

message = subscribe_client.get_message()

# Wait 10 seconds for the node to be marked dead.
import time
time.sleep(11)

# Code for parsing this is similar to the code implementing ray.global_state.client_table()
gcs_entry = ray.gcs_utils.GcsTableEntry.GetRootAsGcsTableEntry(message["data"], 0)

# Get last entry.
client_info =  ray.gcs_utils.ClientTableData.GetRootAsClientTableData(gcs_entry.Entries(gcs_entry.EntriesLength() - 1), 0)

if client_info.IsInsertion():
    # Ignore this case because a node was added not removed.
    pass
else:
    # A node was removed.

    # Which node was removed? Unfortunately this doesn't work yet.
    # I'll fix this in https://github.com/ray-project/ray/pull/3370.
    client_info.NodeManagerAddress()
    client_info.NodeManagerPort()