ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.18k stars 5.8k forks source link

[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

Open martinrebane opened 4 years ago

martinrebane commented 4 years ago

What is the problem?

I am running Ray cluster on Google Cloud Platform, using RaySGD for training my model. Every 30 minutes I see an error in ray monitor [yaml] log: Error during autoscaling (stacktrace attached). This error does no harm and the cluster keeps working fine, nothing is scaled or killed. Finally, after AUTOSCALER_MAX_NUM_FAILURES attemps the error is followed by StandardAutoscaler: Too many errors, abort. and all the worker nodes are killed.

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.8.6, Ubuntu 18.04 on GCP, Pytorch 1.4, RaySGD TorchTrainer via remote() call

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments): I am running my project using remote RaySGD trainer:

stats =  ray.get(trainer.train.remote(max_retries=100, num_steps=50))

The error happens exactly every 30 minutes:

[last 3 shown]
2020-07-08 23:27:17,537 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-08 23:57:43,427 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-09 00:27:35,044 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.

and then

2020-07-09 00:27:35,046 CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.

The reoccurring Error during autoscaling itself does not seem to do any harm at all (all workers are doing their job), but killing of the workers of course is very problematic as new ones won't be created.

Full stacktrace for the last autoscaling error and error that causes the abortion:

2020-07-09 00:27:20,049 INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:20,049 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:20,049 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.15.233': 0.34097719192504883, '10.164.15.235': 0.3409607410430908, '10.164.0.14': 0.34094786643981934}, NodeIdleSeconds=Min=0 Mean=4477 Max=13432, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,142 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,143 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4190995693206787, '10.164.15.233': 0.4190821647644043, '10.164.15.235': 0.4190678596496582}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,144 INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:25,144 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,144 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.420978307723999, '10.164.15.233': 0.4209609031677246, '10.164.15.235': 0.4209465980529785}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,226 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,227 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.40857815742492676, '10.164.15.233': 0.4085671901702881, '10.164.15.235': 0.4085569381713867}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,228 INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:30,228 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,229 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4104330539703369, '10.164.15.233': 0.41042208671569824, '10.164.15.235': 0.4104118347167969}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:35,044 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
    self._update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
    nodes = self.workers()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
    tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
    filter=filter_expr,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
    headers=self.headers,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
    request, service_account=self._service_account_email
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
    recursive=True,
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
    cachekey,
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
    conn, request_uri, method, body, headers
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,046 CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.
Error in monitor loop
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 283, in run
    self._run()
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 238, in _run
    self.autoscaler.update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 118, in update
    raise e
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
    self._update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
    nodes = self.workers()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
    tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
    filter=filter_expr,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
    headers=self.headers,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
    request, service_account=self._service_account_email
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
    recursive=True,
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
    cachekey,
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
    conn, request_uri, method, body, headers
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,047 ERROR autoscaler.py:415 -- StandardAutoscaler: kill_workers triggered
2020-07-09 00:27:35,325 INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-65665abb: Terminating node
2020-07-09 00:27:35,953 INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 to finish...
2020-07-09 00:27:57,311 INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 finished.
2020-07-09 00:27:57,312 INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-75eafb86: Terminating node
2020-07-09 00:27:57,811 INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 to finish...
2020-07-09 00:28:24,453 INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 finished.
2020-07-09 00:28:24,454 ERROR autoscaler.py:420 -- StandardAutoscaler: terminated 2 node(s)
2020-07-09 00:28:25,344 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:28,161 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:31,082 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:34,096 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...

My cluster setup YAML:

cluster_name: sgd-pytorch

provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-b
    project_id: [MYID] # Globally unique project id

auth: {ssh_user: ubuntu}

min_workers: 2
initial_workers: 2
max_workers: 2
idle_timeout_minutes: 25

head_node:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 80
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu

    guestAccelerators:
      - acceleratorType: projects/[MYPROJECT]/zones/europe-west4-b/acceleratorTypes/nvidia-tesla-t4
        acceleratorCount: 1
    metadata:
      items:
        - key: install-nvidia-driver
          value: "True"
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    scheduling:
      - preemptible: true
      - onHostMaintenance: TERMINATE

file_mounts: {
    "/home/ubuntu/mypointnet": "/home/[LOCAL]/project/[project]",
    "/home/ubuntu/model/": "/home/[local]/model/"
}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: #[]
    - sudo apt-get -y install wget nano unzip zip man

# List of shell commands to run to set up nodes.
setup_commands:
    # Note: if you're developing Ray, you probably want to create an AMI that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc

    - pip install --upgrade pip
    - pip install -U oauth2client
    - pip install -U visdom jsonpatch
    - pip install -U shapely descartes ray[rllib]
    - cd [MYFOLDER] && pip install -e . # install my package
    - wget https://storage.googleapis.com/[MYFILE].zip
    - mkdir -p /home/ubuntu/data
    - unzip -uq [MYFILE].zip -d /home/ubuntu/data
    - rm [MYFILE].zip

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install -U google-api-python-client
  - git config --global user.email "[email]"
  - git config --global user.name "[name]"
  - visdom -port=8999 &> /dev/null &
  - mkdir -p /home/ubuntu/checkpoint

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
# NCCL_TIMEOUT_S is the maximum time one training (ray submit) can run
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      export NCCL_TIMEOUT_S=1209600;
      export NCCL_BUFFSIZE=8388608;
      ray start
      --head
      --redis-port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --num-cpus=1
      --num-redis-shards=3

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      export NCCL_TIMEOUT_S=604800;
      export NCCL_BUFFSIZE=8388608;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
      --num-cpus=1
martinrebane commented 4 years ago

PS! I do not know if this is relevant, but I am submitting the job with tmux option:

ray submit train_gcp.yaml ray_trainer.py --tmux
rkooo567 commented 4 years ago

cc @ijrsvt @henktillman

richardliaw commented 4 years ago

cc @martinrebane did you end up resolving this?

martinrebane commented 4 years ago

@richardliaw I solved it by adding export AUTOSCALER_MAX_NUM_FAILURES=100; into my startup yaml file before ray start commands (I added it both for the head and workers). This works well for me, but for new users discovering the problem and solution might be rather frustrating.

So the error is still popping up, but because of max num of failures is high, it does not kill the job.

thomasdesr commented 4 years ago

https://github.com/googleapis/google-api-python-client/issues/218 this seems related? I don't see an obvious workaround though :S