Open sekhar-hari opened 8 months ago
Hey @sekhar-hari, this is a cool idea. I would say that currently LoRAX does not do any swapping of the base model at runtime. But in the future, there are a few things we'd be looking to explore:
But happy to explore this use case more if you feel there would be a way to reuse some components of a shared set of base parameters.
Thanks @tgaddair. Swapping common components of base MoE model would be very helpful. I'm using Mixtral8x7B base model, and this required 2 full NVIDIA A100 80GB GPUs for loading and inference. Then I also have Codellama 70B that requires 3 GPUs to load and do inference with a reasonable response time. All these models that I'm using are foundation (base) models (not quantized) straight from HuggingFace, and I haven't done any LoRA fine tuning on these models.
Keeping such requirements for high-end multiple GPUs will be cost prohibitive for any Generative AI solution going forward especially when these solutions are offered on-prem (cloud / data center etc.).
I know LoRAX supports swapping of fine-tuned LoRA adapters, but if swapping of full base models can be included, it will be a significant leap for LoRAX. Such a feature will certainly enable large-scale adoption of Generative AI solutions.
Do you have any timeline in mind to possibly release this capability in LoRAX? Definitely looking forward to it.
Hi, @sekhar-hari, we don't have a definite timeline yet, but MoE is something we're looking to explore either later this quarter or maybe next quarter.
Ok, understood. Really looking forward to see those features. Will be closely following the LoRAX development here. Thanks.
Model description
Hi - I'm new to LoRAX. Just started reading the docs and github. Here's my situation. I have four locally hosted foundation (base) Open Source LLMs in multiple GPUs. These LLMs are LLaMA-derivatives / QWEN-derivatives (e.g. Smaug-72B).
Compute power requirements are growing and I have a tight budget. Looking to see if LoRAX can help in loading all these four base LLMs into a single, or max two GPUs, all at the same time. So I can choose a base model that I want at runtime through my Chat UI.
Please note, I have not fine-tuned any of these LLMs; so I don't have a LoRA adapter. But I just want to chat with the foundation LLMs like Mixtral8x7B, for example.
Hope someone can let me know if this is possible with the latest version of LoRAX without trading throughput and accuracy. Many thanks,.
Open source status
Provide useful links for the implementation
No response