VLLM Ray Workers are being killed by GCS

rtwang1997 commented 10 months ago

Hi, we are running into a weird issue where the Ray Workers created by VLLM are being killed, even though the deployment itself stays alive. The effect of this is that when you make a request to a model deployment, the following error message occurs, since the workers are already dead but AviaryLLMEngine still has a reference to them via the workers object property:

return (yield from awaitable.__await__())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: 1963424394259fade0c44c5501000000
    pid: 714
    namespace: _ray_internal_dashboard
    ip: 100.64.144.72
The actor is dead because all references to the actor were removed

(...)

raise AsyncEngineDeadError(101vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

In the logs, it appears that the Ray Workers created by VLLM are being killed by GCS with the following message:

49[2023-10-27 15:56:14,963 I 2203 2281] core_worker.cc:3737: Force kill actor request has received. exiting immediately... The actor is dead because all references to the actor were removed.50[2023-10-27 15:56:14,963 W 2203 2281] core_worker.cc:857: Force exit the process.  Details: Worker exits because the actor is killed. The actor is dead because all references to the actor were removed.51[2023-10-27 15:56:14,965 I 2203 2281] core_worker.cc:759: Try killing all child processes of this worker as it exits. Child process pids: 52[2023-10-27 15:56:14,965 I 2203 2281] core_worker.cc:718: Disconnecting to the raylet.53[2023-10-27 15:56:14,965 I 2203 2281] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because the actor is killed. The actor is dead because all references to the actor were removed., has creation_task_exception_pb_bytes=0

More specifically, this part of the message: The actor is dead because all references to the actor were removed appears to indicate that GCS is killing the Ray Workers because it believes that there are no references left to the Ray Workers from anywhere.

However, I don't understand how this could be the case, since the top-level Ray Serve deployment is still alive and the deployment holds on to the VLLMEngine as a reference. VLLMEngine holds on to AviaryAsyncLLMEngine as self.engine, and the AviaryAsyncLLMEngine holds on to AviaryLLMEngine which has a reference to the workers as self.workers.

If the top-level deployment hasn't died, I don't see how the reference count on the workers could have been decremented/why GCS would think that these actors are out of scope.

shrekris-anyscale commented 9 months ago

Could you provide a minimal repro for your setup? Do the vLLM workers die randomly, or is there a pattern?

rtwang1997 commented 9 months ago

Hi,

We are finding that the vLLM workers die after running for around 1 hour (give or take), consistently.

We tried with a very simple ray service, where a deployment creates an actor in it's init function and I get the same behaviour where the actor dies after ~1h because all reference to it get removed. Here's the sample code:

import ray
from ray import serve
from fastapi import FastAPI

app = FastAPI()

@ray.remote
class Actor:
    def __call__(self):
        return "hello world"

@serve.deployment
@serve.ingress(app)
class APIIngress:
    def __init__(self):
        self.actor = Actor.remote()

    @app.get("/health")
    async def healthcheck(self):
        """
        checks if server is up.
        """
        return {"status": "server up"}

    @app.get("/test")
    async def test(self):
        return await self.actor.remote()

SERVE_APP = APIIngress.bind()

shrekris-anyscale commented 9 months ago

Thanks for the repro! What Ray version are you using?

rtwang1997 commented 9 months ago

Hi,

We are using Ray version 2.7.1

shrekris-anyscale commented 8 months ago

Are there any updates here @rtwang1997? I synced with one of your coworkers on Slack and proposed some approaches, but I'm not sure how it went.

shrekris-anyscale commented 7 months ago

@rtwang1997 I'll close this issue for now. Feel free to reopen if you're still running into the problem.

ray-project / ray-llm

VLLM Ray Workers are being killed by GCS #88