A question on serving multiple-lora models in one server

zihaolucky commented 7 months ago

Great project! It's very usful in industry application.

In this example(https://github.com/punica-ai/punica/blob/master/examples/tui-multi-lora.py), all requests are same for different LoRA models, computational efficiency via SGMV is easily understandable.

But, how "Distinct" requests for different LoRA models can still have performance boost? That each request is for a different LoRA model.

abcdabcd987 commented 7 months ago

Thanks for the kind words!

First, a quick clarification, tui-multi-lora.py demonstrates a mixed usage of 6 requests on 4 models. There are 3 demo applications (gsm8k, sqlctx, and viggo). Each of the apps run in two different ways:

Run the LoRA finetuned model
Run the original base model, with some prompt learning.

Therefore you get 3*2=6 requests in the batch, and 3+1=4 models (3 finetuned models + 1 zero weight, i.e., base model).

As for why there is performance gain in Distinct case (i.e., N requests for N LoRA models), I'd say it mostly comes from utilizing more compute units. x@A@B is a small computation, which cannot fully utilize GPU. Batching increases degree of parallelism, thus improves performance. However, this free lunch is not forever, as shown in Figure 8 and 9 in our paper. For more analysis, you can look at Figure 7, the roofline plot, and related text in Section 7.1.

zihaolucky commented 7 months ago

Thanks! I'll read the paper later.

abcdabcd987 commented 6 months ago

Close for inactivity. Feel free to reopen if have any more questions.

punica-ai / punica

A question on serving multiple-lora models in one server #24