Open denadai2 opened 1 week ago
I have this error:
(RayTrainWorker pid=43515) [rank1]: Traceback (most recent call last): (RayTrainWorker pid=43515) [rank1]: File "python/ray/_raylet.pyx", line 2250, in ray._raylet.task_execution_handler (RayTrainWorker pid=43515) [rank1]: File "python/ray/_raylet.pyx", line 2081, in ray._raylet.execute_task_with_cancellation_handler (RayTrainWorker pid=43515) [rank1]: AttributeError: 'Worker' object has no attribute 'core_worker' (RayTrainWorker pid=43515) (RayTrainWorker pid=43515) [rank1]: During handling of the above exception, another exception occurred: (RayTrainWorker pid=43515) (RayTrainWorker pid=43515) [rank1]: Traceback (most recent call last): (RayTrainWorker pid=43515) [rank1]: File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler (RayTrainWorker pid=43515) [rank1]: File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver (RayTrainWorker pid=43515) [rank1]: worker.core_worker.push_error(job_id, error_type, message, time.time()) (RayTrainWorker pid=43515) [rank1]: ^^^^^^^^^^^^^^^^^^ (RayTrainWorker pid=43515) [rank1]: AttributeError: 'Worker' object has no attribute 'core_worker' (RayTrainWorker pid=43515) Exception ignored in: 'ray._raylet.task_execution_handler' (RayTrainWorker pid=43515) Traceback (most recent call last): (RayTrainWorker pid=43515) File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler (RayTrainWorker pid=43515) File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver (RayTrainWorker pid=43515) worker.core_worker.push_error(job_id, error_type, message, time.time()) (RayTrainWorker pid=43515) ^^^^^^^^^^^^^^^^^^ (RayTrainWorker pid=43515) AttributeError: 'Worker' object has no attribute 'core_worker' (RayTrainWorker pid=43514) [rank0]: Traceback (most recent call last): (RayTrainWorker pid=43514) [rank0]: File "python/ray/_raylet.pyx", line 2250, in ray._raylet.task_execution_handler (RayTrainWorker pid=43514) [rank0]: File "python/ray/_raylet.pyx", line 2081, in ray._raylet.execute_task_with_cancellation_handler (RayTrainWorker pid=43514) [rank0]: AttributeError: 'Worker' object has no attribute 'core_worker' (RayTrainWorker pid=43514) (RayTrainWorker pid=43514) [rank0]: During handling of the above exception, another exception occurred: (RayTrainWorker pid=43514) (RayTrainWorker pid=43514) [rank0]: Traceback (most recent call last): (RayTrainWorker pid=43514) [rank0]: File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler (RayTrainWorker pid=43514) [rank0]: File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver (RayTrainWorker pid=43514) [rank0]: worker.core_worker.push_error(job_id, error_type, message, time.time()) (RayTrainWorker pid=43514) [rank0]: ^^^^^^^^^^^^^^^^^^ (RayTrainWorker pid=43514) [rank0]: AttributeError: 'Worker' object has no attribute 'core_worker' (RayTrainWorker pid=43514) Exception ignored in: 'ray._raylet.task_execution_handler' (RayTrainWorker pid=43514) Traceback (most recent call last): (RayTrainWorker pid=43514) File "python/ray/_raylet.pyx", line 2289, in ray._raylet.task_execution_handler (RayTrainWorker pid=43514) File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/utils.py", line 178, in push_error_to_driver (RayTrainWorker pid=43514) worker.core_worker.push_error(job_id, error_type, message, time.time()) (RayTrainWorker pid=43514) ^^^^^^^^^^^^^^^^^^ (RayTrainWorker pid=43514) AttributeError: 'Worker' object has no attribute 'core_worker' (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fffffffffffffffff2520e7442c3968c34db544c01000000 Worker ID: a91f8545715a14d51667aa684e46a2c9e2cf68cbd4ebdb62eadfb416 Node ID: 6b191420c0de3aa47f1e601f4a4cb5b47d016d9bddcb21eebe6e7166 Worker IP address: 127.0.0.1 Worker port: 56517 Worker PID: 43514 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff01b1ad84bd513d3cc1518fa001000000 Worker ID: 6997ec909d56ae315609cdadb6aa7cabe7df4730cc51d98ef93504d0 Node ID: 6b191420c0de3aa47f1e601f4a4cb5b47d016d9bddcb21eebe6e7166 Worker IP address: 127.0.0.1 Worker port: 56513 Worker PID: 43515 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. (TorchTrainer pid=43504) Worker 0 has failed. (RayTrainWorker pid=43515) [2024-09-20 15:32:10,002 C 43515 274022] task_receiver.cc:213: Check failed: objects_valid (RayTrainWorker pid=43515) *** StackTrace Information *** (RayTrainWorker pid=43515) 0 _raylet.so 0x0000000103eb1428 _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<() (RayTrainWorker pid=43515) 1 _raylet.so 0x0000000103eb4d68 _ZN3ray6RayLogD2Ev + 84 ray::RayLog::~RayLog() (RayTrainWorker pid=43515) 2 _raylet.so 0x00000001036f4438 _ZNSt3__110__function6__funcIZN3ray4core12TaskReceiver10HandleTaskERKNS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENSB_IFvvEEESE_EEEE3$_0NS_9allocatorISH_EEFvSG_EEclEOSG_ + 5204 std::__1::__function::__func<>::operator()() (RayTrainWorker pid=43515) 3 _raylet.so 0x00000001036ad314 _ZN3ray4core14InboundRequest6AcceptEv + 128 ray::core::InboundRequest::Accept() (RayTrainWorker pid=43515) 4 _raylet.so 0x00000001036a7498 _ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE + 484 ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled() (RayTrainWorker pid=43515) 5 _raylet.so 0x00000001036a6700 _ZN3ray4core20ActorSchedulingQueue16ScheduleRequestsEv + 1872 ray::core::ActorSchedulingQueue::ScheduleRequests() (RayTrainWorker pid=43515) 6 _raylet.so 0x00000001036a5be4 _ZN3ray4core20ActorSchedulingQueue3AddExxNSt3__18functionIFvNS3_IFvNS_6StatusENS3_IFvvEEES6_EEEEEENS3_IFvRKS4_S8_EEES8_RKNS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEERKNS2_10shared_ptrINS_27FunctionDescriptorInterfaceEEENS_6TaskIDERKNS2_6vectorINS_3rpc15ObjectReferenceENSI_ISV_EEEE + 1752 ray::core::ActorSchedulingQueue::Add() (RayTrainWorker pid=43515) 7 _raylet.so 0x00000001036f11a4 _ZN3ray4core12TaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyENSt3__18functionIFvNS_6StatusENS9_IFvvEEESC_EEE + 3600 ray::core::TaskReceiver::HandleTask() (RayTrainWorker pid=43515) 8 _raylet.so 0x0000000103627bc8 _ZNSt3__110__function6__funcIZN3ray4core10CoreWorker14HandlePushTaskENS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENS9_IFvvEEESC_EEEE4$_47NS_9allocatorISF_EESB_EclEv + 476 std::__1::__function::__func<>::operator()() (RayTrainWorker pid=43515) 9 _raylet.so 0x000000010394d934 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 232 EventTracker::RecordExecution() (RayTrainWorker pid=43515) 10 _raylet.so 0x0000000103947cdc _ZNSt3__110__function6__funcIZN23instrumented_io_context4postENS_8functionIFvvEEENS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEExE3$_0NS9_ISC_EES4_EclEv + 56 std::__1::__function::__func<>::operator()() (RayTrainWorker pid=43515) 11 _raylet.so 0x0000000103947514 _ZN5boost4asio6detail18completion_handlerINSt3__18functionIFvvEEENS0_10io_context19basic_executor_typeINS3_9allocatorIvEELm0EEEE11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm + 192 boost::asio::detail::completion_handler<>::do_complete() (RayTrainWorker pid=43515) 12 _raylet.so 0x0000000103f40500 _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 624 boost::asio::detail::scheduler::do_run_one() (RayTrainWorker pid=43515) 13 _raylet.so 0x0000000103f35078 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run() (RayTrainWorker pid=43515) 14 _raylet.so 0x0000000103f34f60 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run() (RayTrainWorker pid=43515) 15 _raylet.so 0x0000000103579b50 _ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv + 204 ray::core::CoreWorker::RunTaskExecutionLoop() (RayTrainWorker pid=43515) 16 _raylet.so 0x0000000103630d08 _ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv + 268 ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() (RayTrainWorker pid=43515) 17 _raylet.so 0x0000000103630bd4 _ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv + 32 ray::core::CoreWorkerProcess::RunTaskExecutionLoop() (RayTrainWorker pid=43515) 18 _raylet.so 0x0000000103434224 _ZL50__pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loopP7_objectS0_ + 24 __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop() (RayTrainWorker pid=43515) 19 Python 0x0000000100d7d450 method_vectorcall_NOARGS + 120 method_vectorcall_NOARGS (RayTrainWorker pid=43515) 20 Python 0x0000000100e65420 _PyEval_EvalFrameDefault + 43648 _PyEval_EvalFrameDefault (RayTrainWorker pid=43515) 21 Python 0x0000000100e5a730 PyEval_EvalCode + 184 PyEval_EvalCode (RayTrainWorker pid=43515) 22 Python 0x0000000100ebbb3c run_eval_code_obj + 88 run_eval_code_obj (RayTrainWorker pid=43515) 23 Python 0x0000000100eb9bd4 run_mod + 132 run_mod (RayTrainWorker pid=43515) 24 Python 0x0000000100eb90e4 pyrun_file + 156 pyrun_file (RayTrainWorker pid=43515) 25 Python 0x0000000100eb84c4 _PyRun_SimpleFileObject + 288 _PyRun_SimpleFileObject (RayTrainWorker pid=43515) 26 Python 0x0000000100eb8110 _PyRun_AnyFileObject + 80 _PyRun_AnyFileObject (RayTrainWorker pid=43515) 27 Python 0x0000000100edce74 pymain_run_file_obj + 164 pymain_run_file_obj (RayTrainWorker pid=43515) 28 Python 0x0000000100edcbec pymain_run_file + 72 pymain_run_file (RayTrainWorker pid=43515) 29 Python 0x0000000100edc1b0 Py_RunMain + 756 Py_RunMain (RayTrainWorker pid=43515) 30 Python 0x0000000100edc664 pymain_main + 304 pymain_main (RayTrainWorker pid=43515) 31 Python 0x0000000100edc704 Py_BytesMain + 40 Py_BytesMain (RayTrainWorker pid=43515) 32 dyld 0x000000018157c274 start + 2840 start (RayTrainWorker pid=43515) (RayTrainWorker pid=43514) [2024-09-20 15:32:10,003 C 43514 274023] task_receiver.cc:213: Check failed: objects_valid (RayTrainWorker pid=43514) *** StackTrace Information *** (RayTrainWorker pid=43514) 0 _raylet.so 0x0000000105c3d428 _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<() (RayTrainWorker pid=43514) 1 _raylet.so 0x0000000105c40d68 _ZN3ray6RayLogD2Ev + 84 ray::RayLog::~RayLog() (RayTrainWorker pid=43514) 2 _raylet.so 0x0000000105480438 _ZNSt3__110__function6__funcIZN3ray4core12TaskReceiver10HandleTaskERKNS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENSB_IFvvEEESE_EEEE3$_0NS_9allocatorISH_EEFvSG_EEclEOSG_ + 5204 std::__1::__function::__func<>::operator()() (RayTrainWorker pid=43514) 3 _raylet.so 0x0000000105439314 _ZN3ray4core14InboundRequest6AcceptEv + 128 ray::core::InboundRequest::Accept() (RayTrainWorker pid=43514) 4 _raylet.so 0x0000000105433498 _ZN3ray4core20ActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE + 484 ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled() (RayTrainWorker pid=43514) 5 _raylet.so 0x0000000105432700 _ZN3ray4core20ActorSchedulingQueue16ScheduleRequestsEv + 1872 ray::core::ActorSchedulingQueue::ScheduleRequests() (RayTrainWorker pid=43514) 6 _raylet.so 0x0000000105431be4 _ZN3ray4core20ActorSchedulingQueue3AddExxNSt3__18functionIFvNS3_IFvNS_6StatusENS3_IFvvEEES6_EEEEEENS3_IFvRKS4_S8_EEES8_RKNS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEERKNS2_10shared_ptrINS_27FunctionDescriptorInterfaceEEENS_6TaskIDERKNS2_6vectorINS_3rpc15ObjectReferenceENSI_ISV_EEEE + 1752 ray::core::ActorSchedulingQueue::Add() (RayTrainWorker pid=43514) 7 _raylet.so 0x000000010547d1a4 _ZN3ray4core12TaskReceiver10HandleTaskERKNS_3rpc15PushTaskRequestEPNS2_13PushTaskReplyENSt3__18functionIFvNS_6StatusENS9_IFvvEEESC_EEE + 3600 ray::core::TaskReceiver::HandleTask() (RayTrainWorker pid=43514) 8 _raylet.so 0x00000001053b3bc8 _ZNSt3__110__function6__funcIZN3ray4core10CoreWorker14HandlePushTaskENS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENS9_IFvvEEESC_EEEE4$_47NS_9allocatorISF_EESB_EclEv + 476 std::__1::__function::__func<>::operator()() (RayTrainWorker pid=43514) 9 _raylet.so 0x00000001056d9934 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 232 EventTracker::RecordExecution() (RayTrainWorker pid=43514) 10 _raylet.so 0x00000001056d3cdc _ZNSt3__110__function6__funcIZN23instrumented_io_context4postENS_8functionIFvvEEENS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEExE3$_0NS9_ISC_EES4_EclEv + 56 std::__1::__function::__func<>::operator()() (RayTrainWorker pid=43514) 11 _raylet.so 0x00000001056d3514 _ZN5boost4asio6detail18completion_handlerINSt3__18functionIFvvEEENS0_10io_context19basic_executor_typeINS3_9allocatorIvEELm0EEEE11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm + 192 boost::asio::detail::completion_handler<>::do_complete() (RayTrainWorker pid=43514) 12 _raylet.so 0x0000000105ccc500 _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 624 boost::asio::detail::scheduler::do_run_one() (RayTrainWorker pid=43514) 13 _raylet.so 0x0000000105cc1078 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run() (RayTrainWorker pid=43514) 14 _raylet.so 0x0000000105cc0f60 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run() (RayTrainWorker pid=43514) 15 _raylet.so 0x0000000105305b50 _ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv + 204 ray::core::CoreWorker::RunTaskExecutionLoop() (RayTrainWorker pid=43514) 16 _raylet.so 0x00000001053bcd08 _ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv + 268 ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() (RayTrainWorker pid=43514) 17 _raylet.so 0x00000001053bcbd4 _ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv + 32 ray::core::CoreWorkerProcess::RunTaskExecutionLoop() (RayTrainWorker pid=43514) 18 _raylet.so 0x00000001051c0224 _ZL50__pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loopP7_objectS0_ + 24 __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop() (RayTrainWorker pid=43514) 19 Python 0x0000000102b09450 method_vectorcall_NOARGS + 120 method_vectorcall_NOARGS (RayTrainWorker pid=43514) 20 Python 0x0000000102bf1420 _PyEval_EvalFrameDefault + 43648 _PyEval_EvalFrameDefault (RayTrainWorker pid=43514) 21 Python 0x0000000102be6730 PyEval_EvalCode + 184 PyEval_EvalCode (RayTrainWorker pid=43514) 22 Python 0x0000000102c47b3c run_eval_code_obj + 88 run_eval_code_obj (RayTrainWorker pid=43514) 23 Python 0x0000000102c45bd4 run_mod + 132 run_mod (RayTrainWorker pid=43514) 24 Python 0x0000000102c450e4 pyrun_file + 156 pyrun_file (RayTrainWorker pid=43514) 25 Python 0x0000000102c444c4 _PyRun_SimpleFileObject + 288 _PyRun_SimpleFileObject (RayTrainWorker pid=43514) 26 Python 0x0000000102c44110 _PyRun_AnyFileObject + 80 _PyRun_AnyFileObject (RayTrainWorker pid=43514) 27 Python 0x0000000102c68e74 pymain_run_file_obj + 164 pymain_run_file_obj (RayTrainWorker pid=43514) 28 Python 0x0000000102c68bec pymain_run_file + 72 pymain_run_file (RayTrainWorker pid=43514) 29 Python 0x0000000102c681b0 Py_RunMain + 756 Py_RunMain (RayTrainWorker pid=43514) 30 Python 0x0000000102c68664 pymain_main + 304 pymain_main (RayTrainWorker pid=43514) 31 Python 0x0000000102c68704 Py_BytesMain + 40 Py_BytesMain (RayTrainWorker pid=43514) 32 dyld 0x000000018157c274 start + 2840 start (RayTrainWorker pid=43514) (RayTrainWorker pid=43514) /opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown (RayTrainWorker pid=43514) warnings.warn('resource_tracker: There appear to be %d ' 2024-09-20 15:32:10,202 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_b9807_00000 Traceback (most recent call last): File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future result = ray.get(future) ^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/worker.py", line 2661, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/_private/worker.py", line 871, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorDiedError): ray::_Inner.train() (pid=43504, ip=127.0.0.1, actor_id=c4d42c4bfcec8eb72c70d88d01000000, repr=TorchTrainer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/air/_internal/util.py", line 98, in run self._ret = self._target(*self._args, **self._kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 45, in <lambda> training_func=lambda: self._trainable_func(self.config), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/base_trainer.py", line 799, in _trainable_func super()._trainable_func(self._merged_config) File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/tune/trainable/function_trainable.py", line 250, in _trainable_func output = fn() ^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/base_trainer.py", line 107, in _train_coordinator_fn trainer.training_loop() File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 471, in training_loop self._run_training(training_iterator) File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/data_parallel_trainer.py", line 370, in _run_training for training_results in training_iterator: File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/trainer.py", line 124, in __next__ next_results = self._run_with_error_handling(self._fetch_next_result) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/trainer.py", line 89, in _run_with_error_handling return func() ^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/trainer.py", line 156, in _fetch_next_result results = self._backend_executor.get_next_results() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/backend_executor.py", line 600, in get_next_results results = self.get_with_failure_handling(futures) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/backend_executor.py", line 700, in get_with_failure_handling self._increment_failures() File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/backend_executor.py", line 762, in _increment_failures raise failure File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure ray.get(object_ref) ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: RayTrainWorker actor_id: f2520e7442c3968c34db544c01000000 pid: 43514 namespace: 8789b499-0377-4077-b340-4afddfd0a648 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Why and when doesn't Worker have a core_worker attribute?
Worker
core_worker
Thank you
Python 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
ray.__version__ '2.35.0'
I cannot share the script ATM but it uses just a TorchTrainer with a dataloader coming from Torch geometric.
Medium: It is a significant difficulty but I can work around it.
fyi this happens with 2.37 as well
What happened + What you expected to happen
I have this error:
Why and when doesn't
Worker
have acore_worker
attribute?Thank you
Versions / Dependencies
Python 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Reproduction script
I cannot share the script ATM but it uses just a TorchTrainer with a dataloader coming from Torch geometric.
Issue Severity
Medium: It is a significant difficulty but I can work around it.