how does this differ from s-Lora?

priyankat99 commented 7 months ago

really cool project! im wondering how its different from s-Lora? https://github.com/S-LoRA/S-LoRA

tgaddair commented 7 months ago

Great question, @priyankat99!

The short answer is that LoRAX is intended to be a production LLM serving system for handling many fine-tuned LLMs at once, while S-LoRA is a novel algorithm that can allow multi-adapter inference to scale well under heavy load. As an analogy, LoRAX is more like fully-featured inference servers like TGI, vLLM, etc., while S-LoRA is closer to specific algorithms used by those systems like Paged Attention or Flash Attention.

In the S-LoRA repo, they do implement a proof of concept of a fully working serving system, but it has a lot of caveats at the moment. Compared to S-LoRA, LoRAX:

Supports dynamic adapter loading, so you can download adaptrs async on the fly during inference (as far as I can tell, S-LoRA requires all adapters to be loaded in from a directory during initialization)
Supports multiple models: Llama, Mistral, and soon Phi, GPT2, BERT, and more (S-LoRA only supports Llama today)
Supports quantized base models in GPT-Q and bitsandbytes (S-LoRA is only fp16 today)
Supports all linear layers for adapters (S-LoRA only supports q, k, v, o I believe)
Tensor parallelism fully supported (credit to S-LoRA for the algorithm, but it's not implemented in their code yet)
Fallback mechanisms when LoRA weights not supported by CUDA kernels (S-LoRA's CUDA kernels have some caveats about the ranks supported)
Pre-built docker images and Helm charts (no dependency hell)

I do want to caveat this by saying that this isn't to disparage S-LoRA, rather the goals of the projects are quite different. There doesn't seem to be a desire to fix these issues in the S-LoRA repo itself, rather the authors have stated that their intention is to add S-LoRA's capabilities to the existing vLLM project. So if and when that happens, many of these issues may be resolved.

I'll also say that one thing S-LoRA does that LoRAX does not today is support an optimize multi-rank kernel (efficiently handling adapters of different ranks in a single batch). We support this as well, but their kernel is highly optimized for this use case, and our plan is to add their kernel to LoRAX in the near future.

Looking forward, we'll need to see how the integration with vLLM shakes out, though it's not clear what the timeline is on that. I will say, though, that we also considered adding these features to an existing LLM inference library, rather than forking TGI and building LoRAX as a separate project, but found that there were significant architecture changes that needed to be made to support dynamic adapter inference. Going forward, we plan to do a lot more with this idea, including supporting embedding model adapters, classification head adapters, etc. All of this means that I suspect LoRAX will continue to diverge from general purpose LLM inference systems like vLLM over time, rather than these systems converging cleanly. But time will tell!

Hope that answers your question.

priyankat99 commented 7 months ago

ah ok thanks so much this was very helpful and very interesting stuff! to clarify, since LoRax is its own framework, if I were using vLLM or ex-llama, I would have to migrate off those to use LoRax?

tgaddair commented 7 months ago

Hi @priyankat99, yes, LoRAX is a standalone solution for serving. In some cases, it could make sense to, for example, serve your base models with vLLM and then serve your fine-tuned LoRAs with LoRAX, but LoRAX allows you to do both, and we include many of the main features of those libraries, such that LoRAX should be at or near performance parity with those systems (even without multi-LoRA serving):

lizzzcai commented 5 months ago

Hi @tgaddair, just noticed this project in one of the TGI issues. It is amazing.

for example, serve your base models with vLLM and then serve your fine-tuned LoRAs with LoRAX, but LoRAX allows you to do both

I would like to check if it is possible to decouple the adapters and base model in LoRAX (two micro-services).

Examples: multiple LoRAX instances hosting multiple adapters and each of the instances connected to different shared base models (Llma, Mistral etc, hosting via LoraX, vLLM or HF etc).

The use case is we have multiple users and the adapter cannot shared across users. The existing way of loading adapter cannot prevent one user from accessing the adapter from another user and updating S3 credentials will trigger rollout restart as well. In this case, having LoRAX instance hosting the adapters per user can help to solve the above issue. Not sure if you have a similar use case as well, thanks.

tgaddair commented 5 months ago

Hi @lizzzcai, your use case sounds very similar to the one we have at Predibase. I'd definitely love to collaborate on this to make sure it's well supported for you.

It's certainly possible to have one LoRAX instance per tenant, but I consider that pretty heavyweight. What we opt for instead is having the user provide an access token per request that uses an adapter using the LoRAX parameter api_token. This works for the Predibase adapter source, but does not currently work for the more general S3 source.

One thing we could consider do is allow the user to pass their S3 credentials using this same api_token param (possibly encrypted with LoRAX being initialized with the appropriate decryption key). But let me know if you want to find some time to discuss this use case in more depth!

lizzzcai commented 5 months ago

Hi @lizzzcai, your use case sounds very similar to the one we have at Predibase. I'd definitely love to collaborate on this to make sure it's well supported for you.

It's certainly possible to have one LoRAX instance per tenant, but I consider that pretty heavyweight. What we opt for instead is having the user provide an access token per request that uses an adapter using the LoRAX parameter api_token. This works for the Predibase adapter source, but does not currently work for the more general S3 source.

One thing we could consider do is allow the user to pass their S3 credentials using this same api_token param (possibly encrypted with LoRAX being initialized with the appropriate decryption key). But let me know if you want to find some time to discuss this use case in more depth!

Thanks @tgaddair for your reply. I think the existing solution and this api_token approach already help a lot! I am happy to discuss more on it further, I will drop you an email, thanks.

gjfdklfjd commented 3 months ago

Could the user load more than one adapters at a time, so I could serve multiple tasks to diferent customers with multiple adapters? I just find that adapter_id's format is str not a list. There is a way with merge adapters, but might has interference problem, will the interference obvious？ To check interference might be a bit troublesome.

tgaddair commented 3 months ago

Hey @gjfdklfjd, yes, LoRAX can serve multiple tasks to different customers at the same time.

When you say you want to provide multiple adapters per request, what is the expected behavior in that case? Do you want to adapters to be merged together?

gjfdklfjd commented 3 months ago

tgaddair

Hi tgaddair, thank you for you reply. Right now I need to serve multiple tasks to different customers at the same time. Merged adapter might be an alternative way if the independent adapters does not work. I just want to use one base model, with multiple independent adapters, which only take up nearly the gpu memory of one base model takes up. So I could save gpu memory, at the same time, serve diferent customers with their corresponding adapters. How could I do this? Thank you.

predibase / lorax

how does this differ from s-Lora? #90