Open psych0v0yager opened 11 months ago
Hi @psych0v0yager, yes, there are a few ways you can achieve running multiple adapters in a single batch:
We have a very simple example of (3) here, but I'll make a note to add more examples of how to do this using AsyncClient.
Hope that answers your question!
Thanks for the fast reply and multiple solutions! I'll be sure to check out your example for number 3, and I look forward to seeing more documentation on the AsyncClient. Imo the AsyncClient seems like the most convenient for a MoE type situation.
Awesome, @psych0v0yager to help me understand your use case a little better, for the MoE situation you're describing, are you interested in generating a different sequence for each adapter and then combining them, or mixing multiple adapters for the same request and generating a single sequence? It sounds like the first one (generating a different sequence for each adapter), but wanted to confirm, as both are use cases we want to support.
@tgaddair thanks for the reply! I was interested in the first one (generating a different sequence for each adapter).
Specifically I was imagining running 5 adapters concurrently, each of them generating a different sequence. Once the batch of 5 is done, I want to feed all 5 sequences to a 6th adapter that is finetuned to select the best sequence.
The acknowledgements of this project mention the SGMV kernels created by the Punica project. Is there a way we can run multiple adapters simultaneously using LoRAX in a similar way shown in the Punica example? Can this be done via the AsyncClient?