ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.28k stars 5.63k forks source link

Ray core: `AttributeError: 'Worker' object has no attribute 'core_worker'` #47759

Open denadai2 opened 1 week ago

denadai2 commented 1 week ago

What happened + What you expected to happen

I have this error:

(RayTrainWorker pid=43515) [rank1]: Traceback (most recent call last):
(RayTrainWorker pid=43515) [rank1]:   File "python/ray/_raylet.pyx", line 2250, in ray._raylet.task_execution_handler
(RayTrainWorker pid=43515) [rank1]:   File "python/ray/_raylet.pyx", line 2081, in ray._raylet.execute_task_with_cancellation_handler
(RayTrainWorker pid=43515) [rank1]: AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=43515)
(RayTrainWorker pid=43515) [rank1]: During handling of the above exception, another exception occurred:
(RayTrainWorker pid=43515)
(RayTrainWorker pid=43515) [rank1]: Traceback (most recent call last):
(RayTrainWorker pid=43515) [rank1]:   File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler
(RayTrainWorker pid=43515) [rank1]:   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver
(RayTrainWorker pid=43515) [rank1]:     worker.core_worker.push_error(job_id, error_type, message, time.time())
(RayTrainWorker pid=43515) [rank1]:     ^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43515) [rank1]: AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=43515) Exception ignored in: 'ray._raylet.task_execution_handler'
(RayTrainWorker pid=43515) Traceback (most recent call last):
(RayTrainWorker pid=43515)   File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler
(RayTrainWorker pid=43515)   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver
(RayTrainWorker pid=43515)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(RayTrainWorker pid=43515)     ^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43515) AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=43514) [rank0]: Traceback (most recent call last):
(RayTrainWorker pid=43514) [rank0]:   File "python/ray/_raylet.pyx", line 2250, in ray._raylet.task_execution_handler
(RayTrainWorker pid=43514) [rank0]:   File "python/ray/_raylet.pyx", line 2081, in ray._raylet.execute_task_with_cancellation_handler
(RayTrainWorker pid=43514) [rank0]: AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=43514)
(RayTrainWorker pid=43514) [rank0]: During handling of the above exception, another exception occurred:
(RayTrainWorker pid=43514)
(RayTrainWorker pid=43514) [rank0]: Traceback (most recent call last):
(RayTrainWorker pid=43514) [rank0]:   File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler
(RayTrainWorker pid=43514) [rank0]:   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver
(RayTrainWorker pid=43514) [rank0]:     worker.core_worker.push_error(job_id, error_type, message, time.time())
(RayTrainWorker pid=43514) [rank0]:     ^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43514) [rank0]: AttributeError: 'Worker' object has no attribute 'core_worker'
(RayTrainWorker pid=43514) Exception ignored in: 'ray._raylet.task_execution_handler'
(RayTrainWorker pid=43514) Traceback (most recent call last):
(RayTrainWorker pid=43514)   File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler
(RayTrainWorker pid=43514)   File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver
(RayTrainWorker pid=43514)     worker.core_worker.push_error(job_id, error_type, message, time.time())
(RayTrainWorker pid=43514)     ^^^^^^^^^^^^^^^^^^
(RayTrainWorker pid=43514) AttributeError: 'Worker' object has no attribute 'core_worker'
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fffffffffffffffff2520e7442c3968c34db544c01000000 Worker ID: a91f8545715a14d51667aa684e46a2c9e2cf68cbd4ebdb62eadfb416 Node ID: 6b191420c0de3aa47f1e601f4a4cb5b47d016d9bddcb21eebe6e7166 Worker IP address: 127.0.0.1 Worker port: 56517 Worker PID: 43514 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff01b1ad84bd513d3cc1518fa001000000 Worker ID: 6997ec909d56ae315609cdadb6aa7cabe7df4730cc51d98ef93504d0 Node ID: 6b191420c0de3aa47f1e601f4a4cb5b47d016d9bddcb21eebe6e7166 Worker IP address: 127.0.0.1 Worker port: 56513 Worker PID: 43515 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(TorchTrainer pid=43504) Worker 0 has failed.
(RayTrainWorker pid=43515) [2024-09-20 15:32:10,002 C 43515 274022] task_receiver.cc:213:  Check failed: objects_valid
(RayTrainWorker pid=43515) *** StackTrace Information ***
(RayTrainWorker pid=43515) 0   _raylet.so                          0x0000000103eb1428 _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<()
(RayTrainWorker pid=43515) 1   _raylet.so                          0x0000000103eb4d68 _ZN3ray6RayLogD2Ev + 84 ray::RayLog::~RayLog()
(RayTrainWorker pid=43515) 2   _raylet.so                          0x00000001036f4438 _ZNSt3__110__function6__funcIZN3ray4core12TaskReceiver10HandleTaskERKNS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENSB_IFvvEEESE_EEEE3$_0NS_9allocatorISH_EEFvSG_EEclEOSG_ + 5204 std::__1::__function::__func<>::operator()()
(RayTrainWorker pid=43515) 3   _raylet.so                          0x00000001036ad314 _ZN3ray4core14InboundRequest6AcceptEv + 128 ray::core::InboundRequest::Accept()
(RayTrainWorker pid=43515) 4   _raylet.so                          0x00000001036a7498 _ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE + 484 ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
(RayTrainWorker pid=43515) 5   _raylet.so                          0x00000001036a6700 _ZN3ray4core20ActorSchedulingQueue16ScheduleRequestsEv + 1872 ray::core::ActorSchedulingQueue::ScheduleRequests()
(RayTrainWorker pid=43515) 6   _raylet.so                          0x00000001036a5be4 _ZN3ray4core20ActorSchedulingQueue3AddExxNSt3__18functionIFvNS3_IFvNS_6StatusENS3_IFvvEEES6_EEEEEENS3_IFvRKS4_S8_EEES8_RKNS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEERKNS2_10shared_ptrINS_27FunctionDescriptorInterfaceEEENS_6TaskIDERKNS2_6vectorINS_3rpc15ObjectReferenceENSI_ISV_EEEE + 1752 ray::core::ActorSchedulingQueue::Add()
(RayTrainWorker pid=43515) 7   _raylet.so                          0x00000001036f11a4 _ZN3ray4core12TaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyENSt3__18functionIFvNS_6StatusENS9_IFvvEEESC_EEE + 3600 ray::core::TaskReceiver::HandleTask()
(RayTrainWorker pid=43515) 8   _raylet.so                          0x0000000103627bc8 _ZNSt3__110__function6__funcIZN3ray4core10CoreWorker14HandlePushTaskENS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENS9_IFvvEEESC_EEEE4$_47NS_9allocatorISF_EESB_EclEv + 476 std::__1::__function::__func<>::operator()()
(RayTrainWorker pid=43515) 9   _raylet.so                          0x000000010394d934 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 232 EventTracker::RecordExecution()
(RayTrainWorker pid=43515) 10  _raylet.so                          0x0000000103947cdc _ZNSt3__110__function6__funcIZN23instrumented_io_context4postENS_8functionIFvvEEENS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEExE3$_0NS9_ISC_EES4_EclEv + 56 std::__1::__function::__func<>::operator()()
(RayTrainWorker pid=43515) 11  _raylet.so                          0x0000000103947514 _ZN5boost4asio6detail18completion_handlerINSt3__18functionIFvvEEENS0_10io_context19basic_executor_typeINS3_9allocatorIvEELm0EEEE11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm + 192 boost::asio::detail::completion_handler<>::do_complete()
(RayTrainWorker pid=43515) 12  _raylet.so                          0x0000000103f40500 _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 624 boost::asio::detail::scheduler::do_run_one()
(RayTrainWorker pid=43515) 13  _raylet.so                          0x0000000103f35078 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run()
(RayTrainWorker pid=43515) 14  _raylet.so                          0x0000000103f34f60 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run()
(RayTrainWorker pid=43515) 15  _raylet.so                          0x0000000103579b50 _ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv + 204 ray::core::CoreWorker::RunTaskExecutionLoop()
(RayTrainWorker pid=43515) 16  _raylet.so                          0x0000000103630d08 _ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv + 268 ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(RayTrainWorker pid=43515) 17  _raylet.so                          0x0000000103630bd4 _ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv + 32 ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(RayTrainWorker pid=43515) 18  _raylet.so                          0x0000000103434224 _ZL50__pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loopP7_objectS0_ + 24 __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
(RayTrainWorker pid=43515) 19  Python                              0x0000000100d7d450 method_vectorcall_NOARGS + 120 method_vectorcall_NOARGS
(RayTrainWorker pid=43515) 20  Python                              0x0000000100e65420 _PyEval_EvalFrameDefault + 43648 _PyEval_EvalFrameDefault
(RayTrainWorker pid=43515) 21  Python                              0x0000000100e5a730 PyEval_EvalCode + 184 PyEval_EvalCode
(RayTrainWorker pid=43515) 22  Python                              0x0000000100ebbb3c run_eval_code_obj + 88 run_eval_code_obj
(RayTrainWorker pid=43515) 23  Python                              0x0000000100eb9bd4 run_mod + 132 run_mod
(RayTrainWorker pid=43515) 24  Python                              0x0000000100eb90e4 pyrun_file + 156 pyrun_file
(RayTrainWorker pid=43515) 25  Python                              0x0000000100eb84c4 _PyRun_SimpleFileObject + 288 _PyRun_SimpleFileObject
(RayTrainWorker pid=43515) 26  Python                              0x0000000100eb8110 _PyRun_AnyFileObject + 80 _PyRun_AnyFileObject
(RayTrainWorker pid=43515) 27  Python                              0x0000000100edce74 pymain_run_file_obj + 164 pymain_run_file_obj
(RayTrainWorker pid=43515) 28  Python                              0x0000000100edcbec pymain_run_file + 72 pymain_run_file
(RayTrainWorker pid=43515) 29  Python                              0x0000000100edc1b0 Py_RunMain + 756 Py_RunMain
(RayTrainWorker pid=43515) 30  Python                              0x0000000100edc664 pymain_main + 304 pymain_main
(RayTrainWorker pid=43515) 31  Python                              0x0000000100edc704 Py_BytesMain + 40 Py_BytesMain
(RayTrainWorker pid=43515) 32  dyld                                0x000000018157c274 start + 2840 start
(RayTrainWorker pid=43515)
(RayTrainWorker pid=43514) [2024-09-20 15:32:10,003 C 43514 274023] task_receiver.cc:213:  Check failed: objects_valid
(RayTrainWorker pid=43514) *** StackTrace Information ***
(RayTrainWorker pid=43514) 0   _raylet.so                          0x0000000105c3d428 _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<()
(RayTrainWorker pid=43514) 1   _raylet.so                          0x0000000105c40d68 _ZN3ray6RayLogD2Ev + 84 ray::RayLog::~RayLog()
(RayTrainWorker pid=43514) 2   _raylet.so                          0x0000000105480438 _ZNSt3__110__function6__funcIZN3ray4core12TaskReceiver10HandleTaskERKNS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENSB_IFvvEEESE_EEEE3$_0NS_9allocatorISH_EEFvSG_EEclEOSG_ + 5204 std::__1::__function::__func<>::operator()()
(RayTrainWorker pid=43514) 3   _raylet.so                          0x0000000105439314 _ZN3ray4core14InboundRequest6AcceptEv + 128 ray::core::InboundRequest::Accept()
(RayTrainWorker pid=43514) 4   _raylet.so                          0x0000000105433498 _ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE + 484 ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
(RayTrainWorker pid=43514) 5   _raylet.so                          0x0000000105432700 _ZN3ray4core20ActorSchedulingQueue16ScheduleRequestsEv + 1872 ray::core::ActorSchedulingQueue::ScheduleRequests()
(RayTrainWorker pid=43514) 6   _raylet.so                          0x0000000105431be4 _ZN3ray4core20ActorSchedulingQueue3AddExxNSt3__18functionIFvNS3_IFvNS_6StatusENS3_IFvvEEES6_EEEEEENS3_IFvRKS4_S8_EEES8_RKNS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEERKNS2_10shared_ptrINS_27FunctionDescriptorInterfaceEEENS_6TaskIDERKNS2_6vectorINS_3rpc15ObjectReferenceENSI_ISV_EEEE + 1752 ray::core::ActorSchedulingQueue::Add()
(RayTrainWorker pid=43514) 7   _raylet.so                          0x000000010547d1a4 _ZN3ray4core12TaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyENSt3__18functionIFvNS_6StatusENS9_IFvvEEESC_EEE + 3600 ray::core::TaskReceiver::HandleTask()
(RayTrainWorker pid=43514) 8   _raylet.so                          0x00000001053b3bc8 _ZNSt3__110__function6__funcIZN3ray4core10CoreWorker14HandlePushTaskENS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENS9_IFvvEEESC_EEEE4$_47NS_9allocatorISF_EESB_EclEv + 476 std::__1::__function::__func<>::operator()()
(RayTrainWorker pid=43514) 9   _raylet.so                          0x00000001056d9934 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 232 EventTracker::RecordExecution()
(RayTrainWorker pid=43514) 10  _raylet.so                          0x00000001056d3cdc _ZNSt3__110__function6__funcIZN23instrumented_io_context4postENS_8functionIFvvEEENS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEExE3$_0NS9_ISC_EES4_EclEv + 56 std::__1::__function::__func<>::operator()()
(RayTrainWorker pid=43514) 11  _raylet.so                          0x00000001056d3514 _ZN5boost4asio6detail18completion_handlerINSt3__18functionIFvvEEENS0_10io_context19basic_executor_typeINS3_9allocatorIvEELm0EEEE11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm + 192 boost::asio::detail::completion_handler<>::do_complete()
(RayTrainWorker pid=43514) 12  _raylet.so                          0x0000000105ccc500 _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 624 boost::asio::detail::scheduler::do_run_one()
(RayTrainWorker pid=43514) 13  _raylet.so                          0x0000000105cc1078 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run()
(RayTrainWorker pid=43514) 14  _raylet.so                          0x0000000105cc0f60 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run()
(RayTrainWorker pid=43514) 15  _raylet.so                          0x0000000105305b50 _ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv + 204 ray::core::CoreWorker::RunTaskExecutionLoop()
(RayTrainWorker pid=43514) 16  _raylet.so                          0x00000001053bcd08 _ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv + 268 ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(RayTrainWorker pid=43514) 17  _raylet.so                          0x00000001053bcbd4 _ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv + 32 ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(RayTrainWorker pid=43514) 18  _raylet.so                          0x00000001051c0224 _ZL50__pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loopP7_objectS0_ + 24 __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
(RayTrainWorker pid=43514) 19  Python                              0x0000000102b09450 method_vectorcall_NOARGS + 120 method_vectorcall_NOARGS
(RayTrainWorker pid=43514) 20  Python                              0x0000000102bf1420 _PyEval_EvalFrameDefault + 43648 _PyEval_EvalFrameDefault
(RayTrainWorker pid=43514) 21  Python                              0x0000000102be6730 PyEval_EvalCode + 184 PyEval_EvalCode
(RayTrainWorker pid=43514) 22  Python                              0x0000000102c47b3c run_eval_code_obj + 88 run_eval_code_obj
(RayTrainWorker pid=43514) 23  Python                              0x0000000102c45bd4 run_mod + 132 run_mod
(RayTrainWorker pid=43514) 24  Python                              0x0000000102c450e4 pyrun_file + 156 pyrun_file
(RayTrainWorker pid=43514) 25  Python                              0x0000000102c444c4 _PyRun_SimpleFileObject + 288 _PyRun_SimpleFileObject
(RayTrainWorker pid=43514) 26  Python                              0x0000000102c44110 _PyRun_AnyFileObject + 80 _PyRun_AnyFileObject
(RayTrainWorker pid=43514) 27  Python                              0x0000000102c68e74 pymain_run_file_obj + 164 pymain_run_file_obj
(RayTrainWorker pid=43514) 28  Python                              0x0000000102c68bec pymain_run_file + 72 pymain_run_file
(RayTrainWorker pid=43514) 29  Python                              0x0000000102c681b0 Py_RunMain + 756 Py_RunMain
(RayTrainWorker pid=43514) 30  Python                              0x0000000102c68664 pymain_main + 304 pymain_main
(RayTrainWorker pid=43514) 31  Python                              0x0000000102c68704 Py_BytesMain + 40 Py_BytesMain
(RayTrainWorker pid=43514) 32  dyld                                0x000000018157c274 start + 2840 start
(RayTrainWorker pid=43514)
(RayTrainWorker pid=43514) /opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
(RayTrainWorker pid=43514)   warnings.warn('resource_tracker: There appear to be %d '
2024-09-20 15:32:10,202 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_b9807_00000
Traceback (most recent call last):
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
             ^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/worker.py", line 2661, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::_Inner.train() (pid=43504, ip=127.0.0.1, actor_id=c4d42c4bfcec8eb72c70d88d01000000, repr=TorchTrainer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/air/_internal/util.py", line 98, in run
    self._ret = self._target(*self._args, **self._kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/base_trainer.py", line 799, in _trainable_func
    super()._trainable_func(self._merged_config)
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 250, in _trainable_func
    output = fn()
             ^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/base_trainer.py", line 107, in _train_coordinator_fn
    trainer.training_loop()
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 471, in training_loop
    self._run_training(training_iterator)
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 370, in _run_training
    for training_results in training_iterator:
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/trainer.py", line 124, in __next__
    next_results = self._run_with_error_handling(self._fetch_next_result)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/trainer.py", line 89, in _run_with_error_handling
    return func()
           ^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/trainer.py", line 156, in _fetch_next_result
    results = self._backend_executor.get_next_results()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/backend_executor.py", line 600, in get_next_results
    results = self.get_with_failure_handling(futures)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/backend_executor.py", line 700, in get_with_failure_handling
    self._increment_failures()
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/backend_executor.py", line 762, in _increment_failures
    raise failure
  File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
    class_name: RayTrainWorker
    actor_id: f2520e7442c3968c34db544c01000000
    pid: 43514
    namespace: 8789b499-0377-4077-b340-4afddfd0a648
    ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Why and when doesn't Worker have a core_worker attribute?

Thank you

Versions / Dependencies

Python 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin

ray.__version__
'2.35.0'

Reproduction script

I cannot share the script ATM but it uses just a TorchTrainer with a dataloader coming from Torch geometric.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

denadai2 commented 2 days ago

fyi this happens with 2.37 as well