I am trying to run this example using a Ray Actor (the goal is to run the same example on multiple nodes) via Ray Job client. The job always fails with following error:
worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff6494322ded841c228d89f5100f000000 Worker ID: bf8491b7c315c8381e08bde3940062dc85157eecf08a5a994ddba369 Node ID: ee483896ab7698d3fd4dc194e8b14ff98c7afcc2fe0015091808dbad Worker IP address: 10.124.1.4 Worker port: 10014 Worker PID: 1653 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 2. The worker may have exceeded K8s pod memory limits.
The Worker logs do not indicate the cause of the issue. The same example runs well without Ray Actor (or Task).
Versions / Dependencies
Ray = 2.33.0
Python = 3.10
Reproduction script
Submitting Ray job using JobSubmissionClient. RAY_ADDRESS is the address of my Ray cluster (remote). Attaching the script.
client` = JobSubmissionClient(RAY_ADDRESS)
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python actor_rllib_test.py",
# Path to the local directory that contains the python script file.
runtime_env={
"working_dir": "./rllib_scripts",
"pip": [
"ray[rllib]",
"tensorflow",
"torch",
"numpy",
],
}
)
What happened + What you expected to happen
I am trying to run this example using a Ray Actor (the goal is to run the same example on multiple nodes) via Ray Job client. The job always fails with following error:
The Worker logs do not indicate the cause of the issue. The same example runs well without Ray Actor (or Task).
Versions / Dependencies
Ray = 2.33.0 Python = 3.10
Reproduction script
Submitting Ray job using JobSubmissionClient. RAY_ADDRESS is the address of my Ray cluster (remote). Attaching the script.
Issue Severity
High: It blocks me from completing my task.