punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
883 stars 40 forks source link

Why punica is faster than pretrained LLM? #7

Closed shushengyuan closed 7 months ago

shushengyuan commented 7 months ago

I notice in your figure, LoRA with SGMA is even faster than pretrained LLM. It looks like punica is faster than LLM with no-lora. Can you explain the details of expriment and evaluation?

image
abcdabcd987 commented 7 months ago

Good question. One thing to clarify here is that in this figure, "LoRA" only computes the LoRA addon part (Xi @ Ai @ Bi), not including the base model (X@W). So you should think of it as the overhead of LoRA.

More concretely, X @ W would be something like [32, 4096] @ [4096, 11008] whereas LoRA would be something like [32, 1, 4096] @ [32, 4096, 16] @ [32, 16, 11008]. The latter has far less computation (4096 vs 16).

shushengyuan commented 7 months ago

Thank you,it’s easy to misunderstand at first glance

venky1306 commented 3 months ago

Hi @abcdabcd987 , does LoRA with Loop mean that we will calculate lora matrices for different lora adapters in a loop right? like one at a time. Thanks

abcdabcd987 commented 2 months ago

Hi @abcdabcd987 , does LoRA with Loop mean that we will calculate lora matrices for different lora adapters in a loop right? like one at a time. Thanks

Correct. See: https://github.com/punica-ai/punica/blob/591b59899f0a20760821785d06b331c8a2e5cb86/benchmarks/bench_backbone_vs_lora.py#L38-L40