Open TrafalgarZZZ opened 7 months ago
I think this might just be the right solution for KubeRay autoscaling. But let's cross check with @pcmoritz and @Yard1
@simon-mo @pcmoritz @Yard1 Whats up with this? Its really not fun to have to tear down and re-build the engine every time the cluster needs to be resized. Even using Ray Serve requires the GPU count to be known at startup
Maybe there should be an engine arg to disable eager GPU count checks? Maybe also functions to add/remove GPUs?
@rkooo567
The recommended pattern for this is to use Ray Serve with its replica placement group options. You can see it in the code sample in the link above:
return VLLMDeployment.options(
placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK"
).bind(
engine_args,
parsed_args.response_role,
parsed_args.lora_modules,
parsed_args.chat_template,
)
Here, we are telling Ray Serve what resources are required to schedule the replica including any actors it may start (for vLLM, the GPU workers). Ray Serve will allocate the specified placement group and only schedule the replica actor (for vLLM, the code running the engine). Therefore vLLM won't be started until the requisite GPUs are available and won't bump into the issue above.
The recommended pattern for this is to use Ray Serve with its replica placement group options. You can see it in the code sample in the link above:
return VLLMDeployment.options( placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK" ).bind( engine_args, parsed_args.response_role, parsed_args.lora_modules, parsed_args.chat_template, )
Here, we are telling Ray Serve what resources are required to schedule the replica including any actors it may start (for vLLM, the GPU workers). Ray Serve will allocate the specified placement group and only schedule the replica actor (for vLLM, the code running the engine). Therefore vLLM won't be started until the requisite GPUs are available and won't bump into the issue above.
Is it possible to merge the vllm ? I'm hoping for official support for vllm with Ray Serve.
🚀 The feature, motivation and pitch
Hi, I'm deploying vLLM distributed serving in a Kubernetes environment. To make it work, I installed KubeRay to help me manage the ray cluster in Kubernetes. vLLM works well when the ray cluster has enough GPU resources. For example, if
ray status
reports that there are 2 GPUs available now, then vLLM launches successfully with the following command:I also noticed that KubeRay supports AutoScaling, so I would like to leverage the AutoScaling feature to save me more money on GPU instances.
What I expect is that when there are no more GPUs available in the (kube)ray cluster, launching the vLLM should trigger scaling out some ray worker pods with some available GPU inside it, and wait for ray cluster to schedule its
RayLLMWorker
actors. I failed with the following message:It looks like vLLM eagerly checks the available GPU resources at start-up time and fail fast, which makes it not possible to leverage KubeRay's autoscaling feature.
Alternatives
I made some tries on the problem. One simple solution is to delete the eager check on ray cluster's current available resources. For example, my code looks like:
With the code above, it works well. But I think that's just a quick workaround, not a proper solution.
Additional context
No response