Open 502122559 opened 1 year ago
@architkulkarni could you triage it? Thanks!
@502122559 how often does this issue occur? It's hard to debug without more information, could you share the zipped logs from your session? By default these are at /tmp/ray/session_[...]/logs
Although we don't know the root cause, and I'm not sure it's the same issue, we've heard from another user encountering segfaults in JobSupervisor
that it no longer happens in Ray 2.4.0. @502122559 please let us know if you still encounter the issue in Ray 2.4.0.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
What happened + What you expected to happen
ray job failed to submit a task. job returns the following error message: 'Unexpected error occurred: The actor died unexpectedly before finishing this task.\n\tclass_name: JobSupervisor\n\tactor_id: 74c977e8da43355ea9e5d33802000000\n\tpid: 544\n\tname: _ray_internal_job_actor_raysubmit_BAjkPGxZPf1PrcCH\n\tnamespace: SUPERVISOR_ACTOR_RAY_NAMESPACE\n\tip: 10.244.20.40\nThe actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.'
Procedure I see the following error log on the dashboard.
:job_id:020000002:actor_name:JobSupervisor3 SIGSEGV received at time=1680758266 on cpu 33 4PC: @ 0x7ff1ed49b1a8 (unknown) _PyTrash_thread_destroy_chain5 @ 0x7ff1ed0f8090 1029688304 (unknown)6 @ 0x100000001 (unknown) (unknown)7[2023-04-06 13:17:46,772 E 686 825] logging.cc:361: SIGSEGV received at time=1680758266 on cpu 33 8[2023-04-06 13:17:46,772 E 686 825] logging.cc:361: PC: @ 0x7ff1ed49b1a8 (unknown) _PyTrash_thread_destroy_chain9[2023-04-06 13:17:46,774 E 686 825] logging.cc:361: @ 0x7ff1ed0f8090 1029688304 (unknown)10[2023-04-06 13:17:46,777 E 686 825] logging.cc:361: @ 0x100000001 (unknown) (unknown)11Fatal Python error: Segmentation fault1213Stack (most recent call first):14
Versions / Dependencies
ray==2.3.0 python==3.8.13
Reproduction script
Issue Severity
High: It blocks me from completing my task.