ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.25k stars 5.62k forks source link

[air] `tune_worker_fault_tolerance_test` is failing #34508

Closed xwjiang2010 closed 1 year ago

xwjiang2010 commented 1 year ago

https://buildkite.com/ray-project/release-tests-branch/builds/1568#01879142-b943-4c90-bae3-4d97254a660d

xwjiang2010 commented 1 year ago

The failed one does have a suspicious segv in the log but I can't seem to tie it to tune code. @Yard1 any ideas?

Yard1 commented 1 year ago

It looks like the segfault was thrown after the test completed successfully. I think it may be a Ray Core issue, or an issue with how we are running the tests. I had seen that happen before with unrelated workloads, so I think it may be the former.

xwjiang2010 commented 1 year ago

still segv after rerunning

/tmp/ray/session_2023-04-17_15-18-41_784543_194/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_tune_worker_fault_tolerance_hbszehhffu__anyscale_pkg_5ba9b0a57b63f4bf1b3a43bcf3e118cf/workload.sh: line 1:   809 Segmentation fault      (core dumped) python workloads/test_tune_worker_fault_tolerance.py
Subprocess return code: 139
[INFO 2023-04-17 16:18:52,478] anyscale_job_wrapper.py: 191  Process 807 exited with return code 139.
xwjiang2010 commented 1 year ago

@rkooo567 @jjyao @rickyyx Any ideas?

rickyyx commented 1 year ago

Hmm, looks like the python core worker sigterm while exiting:

ab0dc02858b6e137cc20fc757a4a, IsAlive = 0"}
16816:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,378 I 260 13791] core_worker.cc:293: Node failure from 423822a5dfc8437aea274c4f64ecab0dc02858b6e137cc20fc757a4a. All objects pinned on that node will be lost if object reconstruction is not enabled."}
16817:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,484 E 260 260] logging.cc:361: *** SIGTERM received at time=1681773621 on cpu 0 ***"}
16818:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,485 E 260 260] logging.cc:361: PC: @     0x7f00d257746e  (unknown)  epoll_wait"}
16819:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,486 E 260 260] logging.cc:361:     @     0x7f00d27b8420    1307008  (unknown)"}
16820:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,488 E 260 260] logging.cc:361:     @           0x6ddca0        144  (unknown)"}
16821:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,491 E 260 260] logging.cc:361:     @           0x6dd960        384  (unknown)"}
16822:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,495 E 260 260] logging.cc:361:     @           0x6a6bc0  (unknown)  (unknown)"}
16823:{"filename":"python-core-driver-04000000ffffffffffffffffffffffffffffffffffffffffffffffff_260.log","message":"[2023-04-17 16:20:21,509 E 260 260] logging.cc:361:     @           0x6dd960  (unknown)  (unknown)"}
rickyyx commented 1 year ago

Same from the raylet

*** SIGTERM received at time=1681773621 on cpu 0 ***
PC: @     0x7fe36fc0d46e  (unknown)  epoll_wait
    @     0x7fe36fe4e420  126120032  (unknown)
    @           0x6ddca0  (unknown)  (unknown)
[2023-04-17 16:20:21,480 E 362 362] logging.cc:361: *** SIGTERM received at time=1681773621 on cpu 0 ***
[2023-04-17 16:20:21,480 E 362 362] logging.cc:361: PC: @     0x7fe36fc0d46e  (unknown)  epoll_wait
[2023-04-17 16:20:21,483 E 362 362] logging.cc:361:     @     0x7fe36fe4e420  126120032  (unknown)
[2023-04-17 16:20:21,497 E 362 362] logging.cc:361:     @           0x6ddca0  (unknown)  (unknown)
rickyyx commented 1 year ago

Might be related to how the job was killed?

Yard1 commented 1 year ago

The script finished successfully, as indicated by the print statement. The segfault happens at the very end, after everything in the test workload has been ran.

The workload is put into a bash file:

python workload.py

which is then ran through subprocess.run("bash -x workload.sh"). I don't think this would cause a segfault anywhere.

rickyyx commented 1 year ago

@architkulkarni - could it be related to how the job manager kills the job?

@xwjiang2010 also, does this show up consistently with this test? Maybe I can try repro it locally.

xwjiang2010 commented 1 year ago

@rickyyx yes, consistently. I even retried and the same signature. See the link

architkulkarni commented 1 year ago

The job manager kills the job with SIGTERM followed by SIGKILL if the processes haven't exited within 3 seconds. However, this all only happens when the ray job stop API is called (or the equivalent Ray Job SDK method), which I would guess isn't happening here.

xwjiang2010 commented 1 year ago

@rickyyx assigning to you for core side investigation for now. Feel free to assign back to ml team if it's found out not Core issue.

xwjiang2010 commented 1 year ago

cc @krfricke as next ml oncall.

rickyyx commented 1 year ago

Bisecting here: https://buildkite.com/ray-project/release-tests-bisect/builds/78

rickyyx commented 1 year ago

Finally looks like we got a stacktrace from https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_22p2kmm9dffq31a8dk43vbidvb?pjd-section=ray-logs-download-cli-tab

[2023-04-21 15:59:04,500 E 223 235] (gcs_server) ray_syncer-inl.h:305: Failed to read the message from: 3fd2c4710521be4a5d038c6f535bdbf7780e6b39e2707ce56ba2431f
[2023-04-21 16:00:11,585 C 223 223] (gcs_server) container_util.h:88: Key 4825743636673892829 doesn't exist
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x9d9bea) [0x55879aa24bea] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x9db6c2) [0x55879aa266c2] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x9db9d7) [0x55879aa269d7] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x395cd7) [0x55879a3e0cd7] ray::map_find_or_die<>()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x395d15) [0x55879a3e0d15] ray::ClusterResourceManager::GetNodeResources()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x30dfbe) [0x55879a358fbe] ray::gcs::GcsPlacementGroupScheduler::TryReleasingBundleResources()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x30eb6a) [0x55879a359b6a] ray::gcs::GcsPlacementGroupScheduler::ReturnBundleResources()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x310158) [0x55879a35b158] ray::gcs::GcsPlacementGroupScheduler::OnAllBundlePrepareRequestReturned()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x31225c) [0x55879a35d25c] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x37e9bb) [0x55879a3c99bb] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x207f95) [0x55879a252f95] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x5abe66) [0x55879a5f6e66] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x55791e) [0x55879a5a291e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0x557d96) [0x55879a5a2d96] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xacc9ab) [0x55879ab179ab] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xacea99) [0x55879ab19a99] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server(+0xacef52) [0x55879ab19f52] boost::asio::io_context::run()
can-anyscale commented 1 year ago

This test is already passing in master a while ago, can you help confirm @rickyyx . Thanks!

rickyyx commented 1 year ago

We should close this after the linked pr is merged.

can-anyscale commented 1 year ago

@rickyyx , @rkooo567 : should the fix be cherry picked as well, thank?