ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32k stars 5.45k forks source link

[RAY JOB] [Clusters] ray job failed to submit a task #34127

Open 502122559 opened 1 year ago

502122559 commented 1 year ago

What happened + What you expected to happen

ray job failed to submit a task. job returns the following error message: 'Unexpected error occurred: The actor died unexpectedly before finishing this task.\n\tclass_name: JobSupervisor\n\tactor_id: 74c977e8da43355ea9e5d33802000000\n\tpid: 544\n\tname: _ray_internal_job_actor_raysubmit_BAjkPGxZPf1PrcCH\n\tnamespace: SUPERVISOR_ACTOR_RAY_NAMESPACE\n\tip: 10.244.20.40\nThe actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.'

Procedure I see the following error log on the dashboard.

:job_id:020000002:actor_name:JobSupervisor3 SIGSEGV received at time=1680758266 on cpu 33 4PC: @ 0x7ff1ed49b1a8 (unknown) _PyTrash_thread_destroy_chain5 @ 0x7ff1ed0f8090 1029688304 (unknown)6 @ 0x100000001 (unknown) (unknown)7[2023-04-06 13:17:46,772 E 686 825] logging.cc:361: SIGSEGV received at time=1680758266 on cpu 33 8[2023-04-06 13:17:46,772 E 686 825] logging.cc:361: PC: @ 0x7ff1ed49b1a8 (unknown) _PyTrash_thread_destroy_chain9[2023-04-06 13:17:46,774 E 686 825] logging.cc:361: @ 0x7ff1ed0f8090 1029688304 (unknown)10[2023-04-06 13:17:46,777 E 686 825] logging.cc:361: @ 0x100000001 (unknown) (unknown)11Fatal Python error: Segmentation fault1213Stack (most recent call first):14

Versions / Dependencies

ray==2.3.0 python==3.8.13

Reproduction script

from ray.job_submission import JobSubmissionClient, JobStatus
import sys
import time
import argparse

def wait_until_status(job_id, status_to_wait_for, timeout_seconds=300):
    start = time.time()
    while time.time() - start <= timeout_seconds:
        status = client.get_job_status(job_id)
        print(f"status: {status}")
        if status in status_to_wait_for:
            if status == JobStatus.SUCCEEDED:
                print("Successfully connected to the cluster.")
                sys.exit(0)
            else:
                print(client.get_job_logs(job_id))
                print("Failed to connect to the cluster, please redeploy.")
                sys.exit(1)
        time.sleep(1)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, required=True)
    parser.add_argument("--dashboard_port", type=str, required=True)
    parser.add_argument("--client_port", type=str, required=True)
    arg_parser = parser.parse_args()
    client = JobSubmissionClient(f"http://{arg_parser.host}:{arg_parser.dashboard_port}")
    job_id = client.submit_job(
        entrypoint="python entrypoint.py",
        runtime_env={"working_dir": "./cluster_connect_test",
                     "env_vars": {"HOST": arg_parser.host, "PORT": arg_parser.client_port}}
    )
    wait_until_status(job_id, {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED})

Issue Severity

High: It blocks me from completing my task.

kevin85421 commented 1 year ago

@architkulkarni could you triage it? Thanks!

architkulkarni commented 1 year ago

@502122559 how often does this issue occur? It's hard to debug without more information, could you share the zipped logs from your session? By default these are at /tmp/ray/session_[...]/logs

architkulkarni commented 1 year ago

Although we don't know the root cause, and I'm not sure it's the same issue, we've heard from another user encountering segfaults in JobSupervisor that it no longer happens in Ray 2.4.0. @502122559 please let us know if you still encounter the issue in Ray 2.4.0.

stale[bot] commented 8 months ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.