ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.98k stars 5.77k forks source link

[vLLM/Serve] Create polished vLLM example on a Serve deployment #36650

Open cadedaniel opened 1 year ago

cadedaniel commented 1 year ago

The example should show tensor parallelism. I am not sure if Serve + vLLM + tensor parallelism works at the moment because the Serve deployment will request N GPUs, then each vLLM worker will request a GPU, duplicating the GPU resource request.

Extremys commented 1 year ago

Hello, any progress on your side? Do you have examples that we could get inspiration? Thanks.

Oliver-ss commented 1 year ago

mark

Any ideas about this problem? I tried and ray.serve with vllm do double the GPU consumption.

shixianc commented 1 year ago

Is vllm TP support in progress?

rtwang1997 commented 1 year ago

For TP, I have a workaround that seems to work in my case...

I remove the num_gpus option for the deployment itself (VLLMPredictDeployment). This option for the deployment seems mostly to be for scheduling (to tell it to run the deployment on a node with GPU available), since under the hood, vllm also creates a RayWorker that actually uses GPU. The AsyncLLMEngine itself doesn't need GPU to run, as it's calling ray.remote to these RayWorkers that run the workload on GPU.

What you could do instead is to add a custom resource to GPU nodes and have the VLLM deployment require that custom resource to get the top-level deployment to get deployed to a node with GPU. Another alternative is to set num_cpus in such a way that the scheduler will always try to put the vllm deployment onto a GPU node), without actually reserving a GPU for itself.

However, I am running into another issue, which is that for a long-running VLLM deployment, the RayWorker actors that vllm is creating die at some point and are never recreated. In the case (when the RayWorkers are all dead) VLLMPredictDeployment will give the following error:

return (yield from awaitable.__await__())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
    class_name: RayWorker
    actor_id: 1963424394259fade0c44c5501000000
    pid: 714
    namespace: _ray_internal_dashboard
    ip: 100.64.144.72
The actor is dead because all references to the actor were removed

(...)
raise AsyncEngineDeadError(\nvllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause..