Open michaelzhiluo opened 2 years ago
By relaunches the worker do you mean restarts Ray on the worker?
Ray up restarts Ray across the cluster by default.
To avoid the restart, add the flag --no-restart
.
Let me know if that makes sense / solves the issue.
Thanks for the quick reply! By relaunching the worker, I mean that the Autoscaler stops the EC2 worker node and restarts it again. Here is what we are trying to avoid in the image below. When we are running ray up
again, we want to prevent the worker from relaunching from scratch.
.
Got it.
Yeah, that's a bug. Could you post autoscaler logs after running ray up the second time? (ray monitor cluster.yaml
, or the contents of /tmp/ray/session_latest/logs/monitor.*
) Those should have some lines explaining why the worker was taken down.
2021-11-16 00:59:35,192 WARNING worker.py:1227 -- The actor or task with ID fd3463a596384e93cc2f4c914d291ea41a67faa92ae5ef73 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000}, {node:172.31.28.203: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
(task-172.31.8.251 pid=24866) .
(task-172.31.8.251 pid=24866) ..
(autoscaler +20s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +20s) Restarting 1 nodes of type ray.worker.default (lost contact with raylet).
(raylet, ip=172.31.28.203) E1116 00:59:18.757578546 13713 server_chttp2.cc:49] {"created":"@1637024358.757519244","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024358.757513873","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024358.757496636","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757490826","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024358.757512644","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757509731","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 00:59:18,803 C 13713 13713] grpc_server.cc:82: Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203) ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203) ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203) ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203) ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203) ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203) ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203) main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203) boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203) boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203) boost::asio::io_context::run()
(raylet, ip=172.31.28.203) main
(raylet, ip=172.31.28.203) __libc_start_main
(raylet, ip=172.31.28.203)
(raylet, ip=172.31.28.203) E1116 01:00:01.855151317 32369 server_chttp2.cc:49] {"created":"@1637024401.855092228","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024401.855086343","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024401.855067914","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855061141","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024401.855085081","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855082214","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 01:00:01,896 C 32369 32369] grpc_server.cc:82: Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203) ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203) ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203) ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203) ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203) ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203) ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203) main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203) std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203) boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203) boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203) boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203) boost::asio::io_context::run()
(raylet, ip=172.31.28.203) main
(raylet, ip=172.31.28.203) __libc_start_main
(raylet, ip=172.31.28.203)
(autoscaler +1m4s) Removing 1 nodes of type ray.worker.default (launch failed).
(autoscaler +1m9s) Adding 1 nodes of type ray.worker.default.
had a typo in path: /tmp/ray/session_latest/logs/monitor.*
on the head node for the autoscaler logs
those look like driver logs (as opposed to autoscaler logs)
Those logs are helpful, though. What we're seeing is that ray on the worker is failing to get restarted, so the autoscaler freaks out and shuts the worker down before launching a new one to satisfy the min_workers constraint.
Logs for the the thread that is supposed to restart Ray on the worker I think are
/tmp/ray/session_latest/logs/monitor.out
ok, seeing the weirdness with the default example configs
Ray start output when attempting to restart the worker's ray on the second ray up:
Local node IP: 10.0.1.18
[2021-11-15 23:05:38,600 I 224 224] global_state_accessor.cc:394: This node has an IP address of 10.0.1.18, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.
Thanks for the investigation @DmitriGekhtman. FYI this showed up even if Docker is not used - e.g., docker:
removed from the yaml.
@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.
I'm not sure about this. It seems that the registered IP address of Raylet doesn't match the one detected by the driver. So the driver cannot find the local Raylet instance to connect to.
@ConeyLiu Any thoughts?
This looks pretty bad -- I'm seeing in this in other contexts where we try to restart Ray on a node.
Any update? We could work around this by delaying ray up
the second time as much as possible. However at some point it does need to be run again.
Leaving this exclusively to @wuisawesome, since this issue appears to have a Ray-internal component, and that's a good enough reason to disqualify myself.
Sgtm, we still encounter this issue pretty frequently and it'd be great if this issue is resolved soon.
Possibly related to #19834?
Yeah, fairly confident https://github.com/ray-project/ray/issues/19834#issuecomment-1054897153 is related
Basically, yeah, restarting ray on workers makes the worker + head nodes sad.
Is this because ray stop
for worker_start_ray_commands
may not always stop ray correctly? perhaps it leaves a lingering raylet
?
https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-worker-start-ray-commands
A dumb workaround is to try and issue an extra ray stop || true
command before running ray up
. Doesn't seem perfect, but lowers chance of running into this:
https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/run_all.sh#L20-L21
See surround code + files for full repro.
Yeah, not sure if this is new info, but ray stop
does not always seem to stop the server. Confirmed that in a setup I was just running, using my hacky ray_exec_all
script. Output:
https://gist.github.com/EricCousineau-TRI/f2f67c488b75956bbb9d105cc4794ebc#file-ray-stop-failure-sh-L40-L58
Search before asking
Ray Component
Ray Clusters
What happened + What you expected to happen
Ray Autoscaler will relaunch the worker even if the head and worker node are both healthy and their file systems are identical. This can be replicated by running
ray up
on most Autoscaler configuration files over and over again.@concretevitamin @ericl
Versions / Dependencies
Most recent version of Ray and Ray Autoscaler.
Reproduction script
Autoscaler Config provided below. Run
ray up -y config/aws-distributed.yml --no-config-cache
one time and wait (important!) until the worker is fully setup viaray status
. Rinse and repeat on the same configuration file. Eventually, on one of the runs, the Autoscaler will relaunch the worker node.Anything else
No response
Are you willing to submit a PR?