ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.94k stars 5.58k forks source link

[Ray Core] Exit Actor within asyncio.Task might lead to SYSTEM_EXIT #32849

Open Mazyod opened 1 year ago

Mazyod commented 1 year ago

What happened + What you expected to happen

When calling ray.actor.exit_actor() within an actor, within an asyncio.Task, and a client is awaiting that task, Ray throws a SYSTEM_EXIT, dumping the whole stack trace. Within our project, it also caused ray.init to be called again for some reason, leading to an error thrown that ray has already been initialized.

The issue is already described here, but here is the code anyway. Notice how awaiting the asyncio.Task after exit_actor() has been called seems to be fine, the issue might be only when awaiting that task before the exit.

Versions / Dependencies

Python 3.10.9 Ray 2.2.0 WSL 2 Ubuntu 20.04

Reproduction script


version 0.24 on top of Python 3.10.9 /home/[…]/.cache/pypoetry/virtualenvs/[…]-ecZlKSFR-py3.10/bin/python
>>> import ray
>>> import asyncio
>>>
>>> @ray.remote
... class Actor:
...     def __init__(self, fail_after: int):
...         self.task = asyncio.create_task(self.run_task(fail_after))
...
...     async def run_task(self, fail_after: int):
...         await asyncio.sleep(fail_after)
...         ray.actor.exit_actor()
...
...     async def wait_for_task(self):
...         await self.task
...
...
...
>>> async def test_flow(fail_after: int):
...     actor = Actor.remote(fail_after)
...     await asyncio.sleep(2)
...     await actor.wait_for_task.remote()
...     print("Done!")
...
...
>>> ray.init()
2023-02-23 12:10:53,936 INFO worker.py:1538 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.9', ray_version='2.2.0', ray_commit='b6af0887ee5f2e460202133791ad941a41f15beb', address_info={'node_ip_address': '172.20.112.29', 'raylet_ip_address': '172.20.112.29', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-02-23_12-10-52_222376_9462/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-02-23_12-10-52_222376_9462/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2023-02-23_12-10-52_222376_9462', 'metrics_export_port': 64329, 'gcs_address': '172.20.112.29:58132', 'address': '172.20.112.29:58132', 'dashboard_agent_listen_port': 52365, 'node_id': '54813d21ecddaa6145dd2be897c025a32da5ac1b4f3d987180631d33'})

>>> asyncio.run(test_flow(3))
Traceback (most recent call last):
File "<input>", line 1, in <module>
   asyncio.run(test_flow(3))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
   return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
   return future.result()
File "<input>", line 4, in test_flow
   await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
       class_name: Actor
       actor_id: 50b29daf8b55816bc011ba3401000000
       pid: 10853
       namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
       ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. exit_actor() is called.
>>> 
>>> 
>>> asyncio.run(test_flow(3))
Traceback (most recent call last):
File "<input>", line 1, in <module>
   asyncio.run(test_flow(3))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
   return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
   return future.result()
File "<input>", line 4, in test_flow
   await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
       class_name: Actor
       actor_id: e902d6517fd90dd991526d4201000000
       pid: 10915
       namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
       ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. exit_actor() is called.
>>> 
>>> 
>>> asyncio.run(test_flow(1))
Traceback (most recent call last):
File "<input>", line 1, in <module>
   asyncio.run(test_flow(1))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
   return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
   return future.result()
File "<input>", line 4, in test_flow
   await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
       class_name: Actor
       actor_id: 59fa8e5c98e6d5aa14c8efb501000000
       pid: 10977
       namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
       ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by a user request. exit_actor() is called.
>>> 
>>> 
>>> asyncio.run(test_flow(0))
2023-02-23 12:11:50,693 WARNING worker.py:1851 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa7b983ceec1
44a3ecdae053301000000 Worker ID: 23931a8a70fbcbc6acc411033656779b51dc5eeb6e58a31e629eb065 Node ID: 54813d21ecddaa6145dd2be897c025a32da5ac1b4f3d987180631d33 Worker IP address: 172.20.112.29 Worker port: 39009 Worker PID: 11039 Worker exi
t type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray s
top --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
--- Logging error ---
Exception in thread Exception in threading.excepthook:Exception ignored in thread started byException ignored in sys.unraisablehookTraceback (most recent call last):
File "<input>", line 1, in <module>
   asyncio.run(test_flow(0))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
   return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
   return future.result()
File "<input>", line 4, in test_flow
   await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
       class_name: Actor
       actor_id: a7b983ceec144a3ecdae053301000000
       pid: 11039
       namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
       ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is ki
lled by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
>>>

Issue Severity

Medium: It is a significant difficulty but I can work around it.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.