ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

Open michaelzhiluo opened 2 years ago

michaelzhiluo commented 2 years ago

Search before asking

Ray Component

Ray Clusters

What happened + What you expected to happen

Ray Autoscaler will relaunch the worker even if the head and worker node are both healthy and their file systems are identical. This can be replicated by running ray up on most Autoscaler configuration files over and over again.

@concretevitamin @ericl

Versions / Dependencies

Most recent version of Ray and Ray Autoscaler.

Reproduction script

Autoscaler Config provided below. Run ray up -y config/aws-distributed.yml --no-config-cache one time and wait (important!) until the worker is fully setup via ray status. Rinse and repeat on the same configuration file. Eventually, on one of the runs, the Autoscaler will relaunch the worker node.

auth:
  ssh_user: ubuntu
available_node_types:
  ray.head.default:
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 500
      ImageId: ami-04b343a85ab150b2d
      InstanceType: p3.2xlarge
    resources: {}
  ray.worker.default:
    max_workers: 1
    min_workers: 1
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 500
      ImageId: ami-04b343a85ab150b2d
      InstanceType: p3.2xlarge
    resources: {}
cluster_name: temp-aws
docker:
  container_name: ''
  image: ''
  pull_before_run: true
  run_options:
  - --ulimit nofile=65536:65536
  - -p 8008:8008
file_mounts: {}
head_node_type: ray.head.default
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 1
provider:
  cache_stopped_nodes: true
  region: us-east-2
  type: aws
rsync_exclude:
- '**/.git'
- '**/.git/**'
rsync_filter:
- .gitignore
setup_commands:
- pip3 install ray
- mkdir -p /tmp/workdir && cd /tmp/workdir && pip3 install --upgrade pip && pip3
  install ray[default]
upscaling_speed: 1.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Anything else

No response

Are you willing to submit a PR?

DmitriGekhtman commented 2 years ago

By relaunches the worker do you mean restarts Ray on the worker?

Ray up restarts Ray across the cluster by default. To avoid the restart, add the flag --no-restart.

Let me know if that makes sense / solves the issue.

michaelzhiluo commented 2 years ago

Thanks for the quick reply! By relaunching the worker, I mean that the Autoscaler stops the EC2 worker node and restarts it again. Here is what we are trying to avoid in the image below. When we are running ray up again, we want to prevent the worker from relaunching from scratch.

Screen Shot 2021-11-15 at 6 45 38 PM

.

DmitriGekhtman commented 2 years ago

Got it. Yeah, that's a bug. Could you post autoscaler logs after running ray up the second time? (ray monitor cluster.yaml, or the contents of /tmp/ray/session_latest/logs/monitor.*) Those should have some lines explaining why the worker was taken down.

michaelzhiluo commented 2 years ago
2021-11-16 00:59:35,192 WARNING worker.py:1227 -- The actor or task with ID fd3463a596384e93cc2f4c914d291ea41a67faa92ae5ef73 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000}, {node:172.31.28.203: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
(task-172.31.8.251 pid=24866) .
(task-172.31.8.251 pid=24866) ..
(autoscaler +20s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +20s) Restarting 1 nodes of type ray.worker.default (lost contact with raylet).
(raylet, ip=172.31.28.203) E1116 00:59:18.757578546   13713 server_chttp2.cc:49]        {"created":"@1637024358.757519244","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024358.757513873","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024358.757496636","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757490826","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024358.757512644","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757509731","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 00:59:18,803 C 13713 13713] grpc_server.cc:82:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203)     ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203)     ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203)     ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203)     ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203)     ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203)     main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203)     boost::asio::io_context::run()
(raylet, ip=172.31.28.203)     main
(raylet, ip=172.31.28.203)     __libc_start_main
(raylet, ip=172.31.28.203)
(raylet, ip=172.31.28.203) E1116 01:00:01.855151317   32369 server_chttp2.cc:49]        {"created":"@1637024401.855092228","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024401.855086343","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024401.855067914","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855061141","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024401.855085081","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855082214","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 01:00:01,896 C 32369 32369] grpc_server.cc:82:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203)     ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203)     ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203)     ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203)     ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203)     ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203)     main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203)     boost::asio::io_context::run()
(raylet, ip=172.31.28.203)     main
(raylet, ip=172.31.28.203)     __libc_start_main
(raylet, ip=172.31.28.203)
(autoscaler +1m4s) Removing 1 nodes of type ray.worker.default (launch failed).
(autoscaler +1m9s) Adding 1 nodes of type ray.worker.default.
DmitriGekhtman commented 2 years ago

had a typo in path: /tmp/ray/session_latest/logs/monitor.* on the head node for the autoscaler logs

those look like driver logs (as opposed to autoscaler logs)

Those logs are helpful, though. What we're seeing is that ray on the worker is failing to get restarted, so the autoscaler freaks out and shuts the worker down before launching a new one to satisfy the min_workers constraint.

Logs for the the thread that is supposed to restart Ray on the worker I think are /tmp/ray/session_latest/logs/monitor.out

DmitriGekhtman commented 2 years ago

ok, seeing the weirdness with the default example configs

DmitriGekhtman commented 2 years ago

Ray start output when attempting to restart the worker's ray on the second ray up:

Local node IP: 10.0.1.18
[2021-11-15 23:05:38,600 I 224 224] global_state_accessor.cc:394: This node has an IP address of 10.0.1.18, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
DmitriGekhtman commented 2 years ago

@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.

concretevitamin commented 2 years ago

Thanks for the investigation @DmitriGekhtman. FYI this showed up even if Docker is not used - e.g., docker: removed from the yaml.

kfstorm commented 2 years ago

@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.

I'm not sure about this. It seems that the registered IP address of Raylet doesn't match the one detected by the driver. So the driver cannot find the local Raylet instance to connect to.

@ConeyLiu Any thoughts?

DmitriGekhtman commented 2 years ago

This looks pretty bad -- I'm seeing in this in other contexts where we try to restart Ray on a node.

concretevitamin commented 2 years ago

Any update? We could work around this by delaying ray up the second time as much as possible. However at some point it does need to be run again.

DmitriGekhtman commented 2 years ago

Leaving this exclusively to @wuisawesome, since this issue appears to have a Ray-internal component, and that's a good enough reason to disqualify myself.

michaelzhiluo commented 2 years ago

Sgtm, we still encounter this issue pretty frequently and it'd be great if this issue is resolved soon.

EricCousineau-TRI commented 2 years ago

Possibly related to #19834?

EricCousineau-TRI commented 2 years ago

Yeah, fairly confident https://github.com/ray-project/ray/issues/19834#issuecomment-1054897153 is related

Basically, yeah, restarting ray on workers makes the worker + head nodes sad.

Is this because ray stop for worker_start_ray_commands may not always stop ray correctly? perhaps it leaves a lingering raylet? https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-worker-start-ray-commands

EricCousineau-TRI commented 2 years ago

A dumb workaround is to try and issue an extra ray stop || true command before running ray up. Doesn't seem perfect, but lowers chance of running into this: https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/run_all.sh#L20-L21

See surround code + files for full repro.

EricCousineau-TRI commented 2 years ago

Yeah, not sure if this is new info, but ray stop does not always seem to stop the server. Confirmed that in a setup I was just running, using my hacky ray_exec_all script. Output: https://gist.github.com/EricCousineau-TRI/f2f67c488b75956bbb9d105cc4794ebc#file-ray-stop-failure-sh-L40-L58

Script: https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/ray_exec_all.py