predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
2.19k stars 143 forks source link

Question regarding Punica integeration #107

Open psych0v0yager opened 11 months ago

psych0v0yager commented 11 months ago

The acknowledgements of this project mention the SGMV kernels created by the Punica project. Is there a way we can run multiple adapters simultaneously using LoRAX in a similar way shown in the Punica example? Can this be done via the AsyncClient?

tgaddair commented 11 months ago

Hi @psych0v0yager, yes, there are a few ways you can achieve running multiple adapters in a single batch:

  1. Multiple clients making requests at the same time (this is the most common situation we see in production)
  2. Making multiple requests using AsyncClient and then awaiting at the end (batch request submission)
  3. Using another concurrency system like threading and making an HTTP request directly to the endpoint

We have a very simple example of (3) here, but I'll make a note to add more examples of how to do this using AsyncClient.

Hope that answers your question!

psych0v0yager commented 11 months ago

Thanks for the fast reply and multiple solutions! I'll be sure to check out your example for number 3, and I look forward to seeing more documentation on the AsyncClient. Imo the AsyncClient seems like the most convenient for a MoE type situation.

tgaddair commented 11 months ago

Awesome, @psych0v0yager to help me understand your use case a little better, for the MoE situation you're describing, are you interested in generating a different sequence for each adapter and then combining them, or mixing multiple adapters for the same request and generating a single sequence? It sounds like the first one (generating a different sequence for each adapter), but wanted to confirm, as both are use cases we want to support.

psych0v0yager commented 11 months ago

@tgaddair thanks for the reply! I was interested in the first one (generating a different sequence for each adapter).

Specifically I was imagining running 5 adapters concurrently, each of them generating a different sequence. Once the batch of 5 is done, I want to feed all 5 sequences to a 6th adapter that is finetuned to select the best sequence.