Open nstogner opened 2 months ago
This feature would be very useful, there are some estimates on the integration?
We have a few competing priorities right now:
Will work to define some dates soon. Sounds like LoRA would be your number 1 priority out of those?
Yes, serve dynamically loaded LoRA adapters is our number 1 priority.
Can you provide some details on your use case so that we can make sure that we will solve it? Where do you store the adapters? What would be the total expected number of adapter variants you would have for a given model? How are you serving them today?
At this moment we have an 2 instance of Lorax deployed with different base model and for each one 4/5 adapters. We're storing all adapters on S3.
Started work on this, we will get this tackled as our next big feature.
KubeAI should be able to serve dynamically loaded LoRA adapters. Eventually KubeAI could produce these adapters through supporting a finetuning API endpoint, however that can be implemented separately.
I see 2 primary options:
Option A
KubeAI handles shipping adapters to different server Pods (i.e.
kubectl cp
or shared filesystem) and keeping track of which Pods have which adapters.Option B
Server backends handle dynamic loading of adapters themselves and KubeAI just keeps track of what Pods already have adapters in order to load balance effectively.
Not yet supported in vLLM: https://github.com/vllm-project/vllm/issues/6275 Currently supported by Lorax: https://github.com/predibase/lorax