Want Lorax with newer version of TGI

predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

https://loraexchange.ai

Apache License 2.0

1.86k stars 126 forks source link

Want Lorax with newer version of TGI #329

Open yangelaboy opened 3 months ago

yangelaboy commented 3 months ago

Feature request

hello，our models are deploying with TGI(v1.4.3), and we alse want to use lorax. But I find that the tgi version lorax is based on is very different with TGI version v1.4.3。 We are trying to integrate lorax(v0.8) into TGI(v1.4.3)。Is there possible to upgrade TGI of lorax or contribute lorax to TGI？

Motivation

use new features of TGI together with lorax

Your contribution

We are trying to integrate lorax(v0.8) into TGI(v1.4.3)， but both lorax and tgi are changing!

tgaddair commented 3 months ago

Hi @yangelaboy, thanks for trying out LoRAX. I'd love to incorporate more upstream work from TGI, but since they changed their license last year, we can no longer pull their code into our repo.

That said, we have implemented many of the same features recently (though in slightly different ways). Are there specific features you're using in TGI you want to see in LoRAX? If so, we can definitely prioritize getting those added.

One thing in TGI we're working to add very soon is speculative decoding. We think our implementation will be particularly interesting, as we'll be able to handle multiple speculation models at once. Let me know if there are other features you're interested in.

yangelaboy commented 3 months ago

@tgaddair Thinks for detailed replies. We are using features such as speculative decoding(ngram&medusa), quantization, also we're interested in much optimizations of TGI. We also added functions in TGI like shared prefix prompt cache。 Finally, We want a framework which can support different adapter models and medusa models in same self-trained model with a shared prefix prompt cache. I will pay attention to Lorax.

tgaddair commented 3 months ago

Hey @yangelaboy, thanks for this context! The good news is all of the things you listed are on our near-term roadmap.

Speculative decoding adapters per request - this is what I'm currently working on and hope to have out next week
Prefix caching - this is the next major item on the roadmap after speculative decoding, so hopefully a few weeks away at most
Quantization - we support a number of quantization options currently, but let me know if there are specifics ones we don't support that you would be interested in.

I'll definitely let you know when the speculative decoding is ready to test out!

abhibst commented 3 months ago

Thanks @tgaddair , we are also waiting for the Speculative decoding 👍

giyaseddin commented 2 months ago

The license is back to Apache-2.0 https://github.com/huggingface/text-generation-inference/commit/ff42d33e9944832a19171967d2edd6c292bdb2d6 @tgaddair