I'm taking ray-llm out for a spin as we need to shard Mixtral-8x7B AWQ quantized across 2 GPU nodes for our use case (vLLM has been working great for us on a single node).
I can't see the model config for this model inside the directory so I began to put together a config, but it got me thinking, there is no size fits all for the vLLMengine_kwargs. A few of our deployed models rely on different kwargs here to optimise for throughput given particular LLM tasks.
My question is: is it possible override the defaults inside RayServicedefinition ? It would seem extremely limiting if this were not the case, so I'm sure you can - but some confirmation would be great here before I start provisioning node groups and doing some pre-work. Thanks guys :)
I'm taking ray-llm out for a spin as we need to shard Mixtral-8x7B AWQ quantized across 2 GPU nodes for our use case (
vLLM
has been working great for us on a single node).I can't see the model config for this model inside the directory so I began to put together a config, but it got me thinking, there is no size fits all for the
vLLM
engine_kwargs
. A few of our deployed models rely on different kwargs here to optimise for throughput given particular LLM tasks.My question is: is it possible override the defaults inside
RayService
definition ? It would seem extremely limiting if this were not the case, so I'm sure you can - but some confirmation would be great here before I start provisioning node groups and doing some pre-work. Thanks guys :)