Open rkooo567 opened 1 week ago
Need to support ray.wait
on CompiledDAGFuture
and a list of CompiledDAGFuture
as well. @rkooo567, is this blocking you from adopting the new execute_async
API in vLLM? I'm trying to assess the priority of this issue.
In vLLM we don't need it (we only await on the first ref). It is mainly for beta release (aiming the mid Oct).
Also one quick note is that we don't support ray.wait on futures in Ray (we use asyncio.gather instead).
Also other thing to mention is that ray.wait may be more useful when you have multiple dags and wait concurrently.
Got it, will prioritize this over the deserialization issue.
Hey @jeffreyjeffreywang are you working on this issue? Is it okay if I assign you?
@stephanie-wang Yep, I'm working on this. Feel free to assign this to me.
Documenting my progress as well:
asyncio.gather(*refs)
doesn't work is that there is a race condition occurring in CompiledDAGFuture.__await__()
. Was able to get around the issue by introducing a lock.1
to execute_async
, everything works normally, including both sequential await and gather. However, when a numpy array is passed in to execute_async
, the sequential await case hangs at the second iteration. I think that zero-copy deserialization comes into play and blocks us from reading from the output channel.Okay awesome, thanks for the update!
For now I think it'd be good to push a small PR with the asyncio fix, and we can deal with the zero-copy issue separately. Also, you can check if the same issue appears with non-asyncio execution; if it does then it's probably a duplicate of another issue.
I'm worried about the hang issue which will always occur when passing in numpy arrays under my current implementation (with the lock). It also breaks existing tests. The hang issue is specifically for async execution. To be candid, my current implementation might not be exactly correct.
I'm worried about the hang issue which will always occur when passing in numpy arrays under my current implementation (with the lock). It also breaks existing tests. The hang issue is specifically for async execution. To be candid, my current implementation might not be exactly correct.
Ah gotcha, thanks! Sounds likely that there is a leaked reference in python then. gc.get_referrers
might be useful.
Nice, thank you @stephanie-wang, will do a bit more debugging and update this thread.
What happened + What you expected to happen
When we have multi output refs, we want to batch wait the result. For the sync case, I verified ray.get works. But for async case, asyncio.gather doesn't seem to work.
Versions / Dependencies
master
Reproduction script
Issue Severity
None