ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.7k stars 5.73k forks source link

[core]ray job submit job faild. #40696

Closed kadisi closed 1 month ago

kadisi commented 1 year ago

What happened + What you expected to happen

when i exec : RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python demo.py it out put job failed

Job submission server address: http://127.0.0.1:8265
2023-10-26 12:03:45,399 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_221be757bb71f486.zip.
2023-10-26 12:03:45,399 INFO packaging.py:518 -- Creating a file package for local directory '.'.

-------------------------------------------------------
Job 'raysubmit_zU9BFLQ4RxXdxY6j' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_zU9BFLQ4RxXdxY6j
  Query the status of the job:
    ray job status raysubmit_zU9BFLQ4RxXdxY6j
  Request the job to be stopped:
    ray job stop raysubmit_zU9BFLQ4RxXdxY6j

Tailing logs until the job exits (disable with --no-wait):

---------------------------------------
Job 'raysubmit_zU9BFLQ4RxXdxY6j' failed
---------------------------------------

Status message: Unexpected error occurred: The actor died unexpectedly before finishing this task.

Versions / Dependencies

ray: 2.7.1 ubuntu: 22.04 5.15.0-83-generi python: 3.10.12 Driver Version: 545.23.06
CUDA Version: 12.3 aiohttp: 3.8.6

Reproduction script

i have 3 ubuntu vms:

head node(172.20.159.147): ray start --head --num-cpus=0 --num-gpus=0 --node-ip-address=0.0.0.0 --dashboard-host=0.0.0.0 -v

work node1(172.20.159.145): ray start --address='172.20.159.147:6379' work node2(172.20.159.146): ray start --address='172.20.159.147:6379'

ray status
======== Autoscaler status: 2023-10-26 12:11:27.626579 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_b6d1faa241d3a779ced0b5b88170fb9f059d36955af8f268ea765047
 1 node_8b0ef2f35483b32388a27dfc0d1350be1b9c58411eb7a4e4fa2de2fa
 1 node_6531a98abac1d92e6ada0c53a63e4ec0ab420a37925a7cfc69e098c6
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/16.0 CPU
 0.0/2.0 GPU
 0B/58.76GiB memory
 0B/26.44GiB object_store_memory

Demands:
 (no resource demands)

on head node ,i exec these commands:

RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python demo.py

Job submission server address: http://127.0.0.1:8265
2023-10-26 12:12:29,172 INFO dashboard_sdk.py:385 -- Package gcs://_ray_pkg_221be757bb71f486.zip already exists, skipping upload.

-------------------------------------------------------
Job 'raysubmit_7VbTzaeLkdSHugG3' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_7VbTzaeLkdSHugG3
  Query the status of the job:
    ray job status raysubmit_7VbTzaeLkdSHugG3
  Request the job to be stopped:
    ray job stop raysubmit_7VbTzaeLkdSHugG3

Tailing logs until the job exits (disable with --no-wait):

---------------------------------------
Job 'raysubmit_7VbTzaeLkdSHugG3' failed
---------------------------------------

Status message: Unexpected error occurred: The actor died unexpectedly before finishing this task.

the contents of demo.py is

import time

import ray

@ray.remote
def hello_world():
    return "hello world"

# Automatically connect to the running Ray cluster.
#ray.init(address="auto")
ray.init()
print("############")
print(ray.get(hello_world.remote()))
#while True:
#    print(ray.get(hello_world.remote()))
#    time.sleep(1)

there is some error from /tmp/ray/session_latest/logs/dashboard.log


2023-10-26 12:14:09,301 INFO web_log.py:206 -- 127.0.0.1 [26/Oct/2023:04:14:09 +0000] 'GET /api/version HTTP/1.1' 200 256 bytes 498 us '-' 'python-requests/2.31.0'
2023-10-26 12:14:09,304 INFO web_log.py:206 -- 127.0.0.1 [26/Oct/2023:04:14:09 +0000] 'GET /api/version HTTP/1.1' 200 256 bytes 338 us '-' 'python-requests/2.31.0'
2023-10-26 12:14:09,308 INFO web_log.py:206 -- 127.0.0.1 [26/Oct/2023:04:14:09 +0000] 'GET /api/packages/gcs/_ray_pkg_221be757bb71f486.zip HTTP/1.1' 200 150 bytes 1174 us '-' 'python-requests/2.31.0'
2023-10-26 12:14:09,319 INFO web_log.py:206 -- 127.0.0.1 [26/Oct/2023:04:14:09 +0000] 'POST /api/jobs/ HTTP/1.1' 200 245 bytes 8447 us '-' 'python-requests/2.31.0'
2023-10-26 12:14:09,322 INFO web_log.py:206 -- 127.0.0.1 [26/Oct/2023:04:14:09 +0000] 'GET /api/version HTTP/1.1' 200 256 bytes 346 us '-' 'python-requests/2.31.0'
2023-10-26 12:14:12,342 ERROR web_protocol.py:403 -- Error handling request
Traceback (most recent call last):
  File "/root/.virtualenvs/tensorflow/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 433, in _handle_request
    resp = await request_handler(request)
  File "/root/.virtualenvs/tensorflow/lib/python3.10/site-packages/aiohttp/web_app.py", line 504, in _handle
    resp = await handler(request)
  File "/root/.virtualenvs/tensorflow/lib/python3.10/site-packages/aiohttp/web_middlewares.py", line 117, in impl
    return await handler(request)
  File "/root/.virtualenvs/tensorflow/lib/python3.10/site-packages/ray/dashboard/http_server_head.py", line 137, in metrics_middleware
    status_tag = f"{floor(response.status / 100)}xx"
AttributeError: 'NoneType' object has no attribute 'status'
2023-10-26 12:14:12,350 INFO web_log.py:206 -- 127.0.0.1 [26/Oct/2023:04:14:12 +0000] 'GET /api/jobs/raysubmit_mbXhqy9kcSCXS1M2 HTTP/1.1' 200 636 bytes 5135 us '-' 'python-requests/2.31.0'

but when i exec demo.py on head node , it is ok:

python demo.py
2023-10-26 12:18:24,889 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 0.0.0.0:6379...
2023-10-26 12:18:24,901 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at 0.0.0.0:8265
############

Issue Severity

None

anyscalesam commented 1 year ago

@architkulkarni can you please triage?

architkulkarni commented 1 year ago

@kadisi Does this happen every time, or only sometimes? I'm not sure if it's possible to get more information about why the actor died, since the error "The actor died unexpectedly before finishing this task." is very general (@jjyao do you know?), but if you could zip up and attach the logs, that would potentially be helpful.

The AttributeError: 'NoneType' object has no attribute 'status' is a separate issue which I believe shouldn't actually cause the job to fail. The error message is the same as https://github.com/ray-project/ray/issues/29632 so the root cause might be similar.

kadisi commented 1 year ago

Does this happen every time, or only sometimes? I'm not sure if it's possible to get more information about why the actor died, since the error "The actor died unexpectedly before finishing this task." is very general (@jjyao do you know?), but if you could zip up and attach the logs, that would potentially be helpful.

The AttributeError: 'NoneType' object has no attribute 'status' is a separate issue which I believe shouldn't actually cause the job to fail. The error message is the same as #29632 so the root cause might be similar.

it happened every time.

architkulkarni commented 1 year ago

@kadisi Thanks for the info, it would be helpful if you could share a zip of the logs on each node. By default they're in /tmp/ray/session_latest/logs on each node.