ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.95k stars 5.44k forks source link

[Core] Task manager check failed caused GCS coredump. #43658

Open MissiontoMars opened 4 months ago

MissiontoMars commented 4 months ago

What happened + What you expected to happen

[2024-03-04 01:18:44,055 W 156 235] (gcs_server) gcs_task_manager.cc:286: Max number of tasks event (100000) allowed is reached. Old task events will be overwritten. Set RAY_task_events_max_num_task_in_gcs to a higher value to store more. [2024-03-04 01:18:48,602 C 156 235] (gcs_server) gcs_task_manager.cc:116: Check failed: idx_itr != task_attemptindex.end() Task attempt of task: NIL_ID, attempt_number: 0 should have task events in the buffer but missing. StackTrace Information /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x9a70aa) [0x56089868b0aa] ray::operator<<() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x9a8b82) [0x56089868cb82] ray::SpdLogMessage::Flush() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x9a8e97) [0x56089868ce97] ray::RayLog::~RayLog() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x296e27) [0x560897f7ae27] ray::gcs::GcsTaskManager::GcsTaskManagerStorage::GetTaskEvent() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x296ecf) [0x560897f7aecf] ray::gcs::GcsTaskManager::GcsTaskManagerStorage::MarkTaskAttemptFailed() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x29762e) [0x560897f7b62e] ray::gcs::GcsTaskManager::GcsTaskManagerStorage::MarkTasksFailed() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x29790a) [0x560897f7b90a] boost::asio::detail::wait_handler<>::do_complete() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0xa9926b) [0x56089877d26b] boost::asio::detail::scheduler::do_run_one() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0xa9a501) [0x56089877e501] boost::asio::detail::scheduler::run() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0xa9a770) [0x56089877e770] boost::asio::io_context::run() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0x1f8c8e) [0x560897edcc8e] std::thread::_State_impl<>::_M_run() /usr/local/lib/python3.9/dist-packages/ray/core/src/ray/gcs/gcs_server(+0xafadb0) [0x5608987dedb0] execute_native_thread_routine /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7f355d15fea7] start_thread /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f355cd48a2f] __clone

Versions / Dependencies

Ray2.3.1

Reproduction script

Hard to reproduce.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

MissiontoMars commented 4 months ago

@rickyyx Could you take a look at this problem?

jjyao commented 2 weeks ago

@MissiontoMars is this still an issue with latest Ray?