huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.86k stars 1.04k forks source link

What would it take to support multiple LoRAs with a single backbone? #907

Closed ToddMorrill closed 1 year ago

ToddMorrill commented 1 year ago

Feature request

An increasingly common question is how to support inference for multiple LoRA models running against a single backbone model. What's preventing TGI from implementing a feature like this? I realize there may be a million little reasons but what are the core blockers (e.g. routing, continuous batching, etc.)?

Motivation

The value proposition here is that you can reduce the memory footprint required to serve several models on a single server. You might also get better hardware utilization since demand patterns for different models may complement one another.

Your contribution

I have a decent sense for the optimizations you're applying to the models (e.g. tensor parallelism, kernel fusion, etc.) but less so around routing and continuous batching. I'd have some learning to do before I could contribute a feature like this.

Narsil commented 1 year ago

This is something we have in mind, although we're not entirely sure it fits TGI directly.

The feature creep is the biggest reason why we're not sure about it.

The other thing is that supporting multiple Loras will add more memory shifting to a system that's already memory bound, so adding latency, which is the thing we care the most about.

That being said it would be nice indeed to use the reduced memory constraints to serve mulitple models from one.

Code wise it's rather tricky ,because everything is so ingrained as single batch, no padding that having a different model in front for various parts of the batch will require some clever thinking.

RonanKMcGovern commented 1 year ago

Can't you just apply a LoRa and then apply another one? (sorry, naive point probably).

Narsil commented 1 year ago

If you have 2 requests on 2 different loras stacked on the same batch ?

RonanKMcGovern commented 1 year ago

If you have 2 requests on 2 different loras stacked on the same batch ?

Ah ok. Yeah if the requests are for different Lora's then that's tricky. I misunderstood.

ToddMorrill commented 1 year ago

Glad to know you guys are thinking about it. Thanks.

garython commented 5 months ago

Now that vLLM support this as an experimental feature, can TGI also consider support for multi-lora. https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit#slide=id.g265286f78bf_0_482