ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.21k stars 5.81k forks source link

[core] Segfault happens when continuously discconnect and reconnect ray node #34637

Open rickyyx opened 1 year ago

rickyyx commented 1 year ago

What happened + What you expected to happen

python3 apps/tpc-h/tpch.py
(raylet, ip=172.31.62.47) *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: *** SIGSEGV received at time=1681697165 on cpu 3 ***
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,049 E 19414 19462] logging.cc:361: PC: @ 0x7f060ca5fd20 (unknown) absl::lts_20211102::Mutex::Lock()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060d880090 3504 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4dfe1f 192 ray::gcs::NodeInfoAccessor::HandleNotification()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47dc0f 64 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4b68f5 176 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c4de940 112 ray::rpc::GcsRpcClient::GetAllNodeInfo()::{lambda()#2}::operator()()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c47f595 64 ray::rpc::ClientCallImpl<>::OnReplyReceived()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c345ff5 32 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c6006a6 96 EventTracker::RecordExecution()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b95ee 48 std::_Function_handler<>::_M_invoke()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c5b9766 112 boost::asio::detail::completion_handler<>::do_complete()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca3157b 128 boost::asio::detail::scheduler::do_run_one()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca327b1 192 boost::asio::detail::scheduler::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060ca32a20 64 boost::asio::io_context::run()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060c3c387d 240 ray::core::CoreWorker::RunIOService()
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x7f060cb5e6d0 (unknown) execute_native_thread_routine
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,050 E 19414 19462] logging.cc:361: @ 0x20d3850 182129536 (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,052 E 19414 19462] logging.cc:361: @ 0x7f060c325ba0 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) [2023-04-17 02:06:05,053 E 19414 19462] logging.cc:361: @ 0x9000838b51e90789 (unknown) (unknown)
(pid=19414, ip=172.31.62.47) Fatal Python error: Segmentation fault

Original issue reported https://discuss.ray.io/t/very-rare-error-that-occurs-when-nodes-disconnect-and-then-reconnect/10256/3

Versions / Dependencies

master

Reproduction script

Have ray cluster, and go on one worker node when a job is running. Now ray stop. Then after job completes try connect worker node again and relaunch job.

TODO

Issue Severity

None

rkooo567 commented 1 year ago

We will try reproducing it

rkooo567 commented 1 year ago

@rickyyx can you provide me a repro script?

rkooo567 commented 1 year ago

cc @rickyyx any follow up for the repro? We will tag it as P2 until repro is found

rickyyx commented 1 year ago

Sorry - I will try to work on a repro ASAP

cread commented 1 year ago

We have started to observe this too. I've not tried to build a repro yet, but we see it on large production clusters where there is a lot of worker churn due to spot availability.

We are running Ray 2.6.3 conda packages.

rkooo567 commented 1 year ago

Hmm @rickyyx do you think you will have time in Ray 2.9 to start making a repro script?