Support dynamic LoRA serving

substratusai / kubeai

AI Inference Operator for Kubernetes

https://www.kubeai.org

Apache License 2.0

544 stars 43 forks source link

Support dynamic LoRA serving #132

Open nstogner opened 2 months ago

nstogner commented 2 months ago

KubeAI should be able to serve dynamically loaded LoRA adapters. Eventually KubeAI could produce these adapters through supporting a finetuning API endpoint, however that can be implemented separately.

I see 2 primary options:

Option A

KubeAI handles shipping adapters to different server Pods (i.e. kubectl cp or shared filesystem) and keeping track of which Pods have which adapters.

Option B

Server backends handle dynamic loading of adapters themselves and KubeAI just keeps track of what Pods already have adapters in order to load balance effectively.

Not yet supported in vLLM: https://github.com/vllm-project/vllm/issues/6275 Currently supported by Lorax: https://github.com/predibase/lorax

ffais commented 3 weeks ago

This feature would be very useful, there are some estimates on the integration?

nstogner commented 3 weeks ago

We have a few competing priorities right now:

LoRA
vLLM Cache-aware routing (should greatly improve perf when model replicas >1)
Scaling a given model across different types of infra (i.e. CPU -> GPU -> TPU)

Will work to define some dates soon. Sounds like LoRA would be your number 1 priority out of those?

ffais commented 3 weeks ago

Yes, serve dynamically loaded LoRA adapters is our number 1 priority.

nstogner commented 3 weeks ago

Can you provide some details on your use case so that we can make sure that we will solve it? Where do you store the adapters? What would be the total expected number of adapter variants you would have for a given model? How are you serving them today?

ffais commented 2 weeks ago

At this moment we have an 2 instance of Lorax deployed with different base model and for each one 4/5 adapters. We're storing all adapters on S3.

nstogner commented 2 weeks ago

Started work on this, we will get this tackled as our next big feature.