Closed zihaolucky closed 6 months ago
Thanks for the kind words!
First, a quick clarification, tui-multi-lora.py
demonstrates a mixed usage of 6 requests on 4 models. There are 3 demo applications (gsm8k
, sqlctx
, and viggo
). Each of the apps run in two different ways:
Therefore you get 3*2=6
requests in the batch, and 3+1=4
models (3 finetuned models + 1 zero weight, i.e., base model).
As for why there is performance gain in Distinct case (i.e., N requests for N LoRA models), I'd say it mostly comes from utilizing more compute units. x@A@B
is a small computation, which cannot fully utilize GPU. Batching increases degree of parallelism, thus improves performance. However, this free lunch is not forever, as shown in Figure 8 and 9 in our paper. For more analysis, you can look at Figure 7, the roofline plot, and related text in Section 7.1.
Thanks! I'll read the paper later.
Close for inactivity. Feel free to reopen if have any more questions.
Great project! It's very usful in industry application.
In this example(https://github.com/punica-ai/punica/blob/master/examples/tui-multi-lora.py), all requests are same for different LoRA models, computational efficiency via SGMV is easily understandable.
But, how "Distinct" requests for different LoRA models can still have performance boost? That each request is for a different LoRA model.