Open cadedaniel opened 1 year ago
Hello, any progress on your side? Do you have examples that we could get inspiration? Thanks.
mark
Any ideas about this problem? I tried and ray.serve with vllm do double the GPU consumption.
Is vllm TP support in progress?
For TP, I have a workaround that seems to work in my case...
I remove the num_gpus
option for the deployment itself (VLLMPredictDeployment
). This option for the deployment seems mostly to be for scheduling (to tell it to run the deployment on a node with GPU available), since under the hood, vllm also creates a RayWorker that actually uses GPU. The AsyncLLMEngine itself doesn't need GPU to run, as it's calling ray.remote to these RayWorkers that run the workload on GPU.
What you could do instead is to add a custom resource to GPU nodes and have the VLLM deployment require that custom resource to get the top-level deployment to get deployed to a node with GPU. Another alternative is to set num_cpus
in such a way that the scheduler will always try to put the vllm deployment onto a GPU node), without actually reserving a GPU for itself.
However, I am running into another issue, which is that for a long-running VLLM deployment, the RayWorker
actors that vllm is creating die at some point and are never recreated. In the case (when the RayWorkers
are all dead) VLLMPredictDeployment
will give the following error:
return (yield from awaitable.__await__())
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayWorker
actor_id: 1963424394259fade0c44c5501000000
pid: 714
namespace: _ray_internal_dashboard
ip: 100.64.144.72
The actor is dead because all references to the actor were removed
(...)
raise AsyncEngineDeadError(\nvllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause..
The example should show tensor parallelism. I am not sure if Serve + vLLM + tensor parallelism works at the moment because the Serve deployment will request N GPUs, then each vLLM worker will request a GPU, duplicating the GPU resource request.