Open stbnps opened 5 months ago
@stbnps are you able to repro on a newer ray version?
With Ray 2.24.0 and Python 3.11.9 I cannot even call done, pending = await asyncio.wait(self.worker_tasks, timeout=0.001)
, it throws the following error:
Traceback (most recent call last):
File "/media/stbn/2TB/git_chep/ray-migration/tmp.py", line 50, in <module>
asyncio.run(main())
File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/media/stbn/2TB/git_chep/ray-migration/tmp.py", line 48, in main
await supervisor.run.remote()
ray.exceptions.RayTaskError(AttributeError): ray::Supervisor.run() (pid=38702, ip=192.168.68.32, actor_id=08a78a513a863842cede879e01000000, repr=<tmp.Supervisor object at 0x7fed29e92710>)
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/stbn/2TB/git_chep/ray-migration/tmp.py", line 41, in run
await self.check_workers()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/stbn/2TB/git_chep/ray-migration/tmp.py", line 27, in check_workers
done, pending = await asyncio.wait(self.worker_tasks, timeout=0.001)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/tasks.py", line 428, in wait
return await _wait(fs, timeout, return_when, loop)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/tasks.py", line 532, in _wait
f.add_done_callback(_on_completion)
^^^^^^^^^^^^^^^^^^^
AttributeError: 'ray._raylet.ObjectRef' object has no attribute 'add_done_callback'
However, in the docs I see the following example, which indicates that I'd be able to call asyncio.wait on an ObjectRef:
import ray
import asyncio
@ray.remote
def some_task():
return 1
async def await_obj_ref():
await some_task.remote()
await asyncio.wait([some_task.remote()])
asyncio.run(await_obj_ref())
When testing Ray 2.24 with Python 3.11 using asyncio.wrap_future(worker.run.remote().future())
, the code seems to work and I don't see any memory increase, like in my previous comment.
Instead, when using Python 3.10 and Ray 2.24, worker_task = worker.run.remote()
does not crash, but we still see the memory increase. We don't see the memory increasing when using asyncio.wrap_future
though.
Are we now seeing 2 different issues? (The memory increase and the add_done_callback error)
@dentiny can you take a look at this one? If your time permitted.
I could try to repro :)
For faster debug/get-help please join ray.slack.com and ask them on #ray-contributors; slack me too @anyscalesam if you have Qs!
What happened + What you expected to happen
When running the script below, the memory utilization of the
Supervisor
actor increases slowly over time. This usually leads to OOM for more complex applications.This memory increase does not happen when we use
worker_task = asyncio.wrap_future(worker.run.remote().future())
instead ofworker_task = worker.run.remote()
Versions / Dependencies
Python 3.8.19 Ray 2.10.0 Numpy 1.24.4
Reproduction script
Issue Severity
High: It blocks me from completing my task.