Closed shushengyuan closed 7 months ago
Good question. One thing to clarify here is that in this figure, "LoRA" only computes the LoRA addon part (Xi @ Ai @ Bi
), not including the base model (X@W
). So you should think of it as the overhead of LoRA.
More concretely, X @ W
would be something like [32, 4096] @ [4096, 11008]
whereas LoRA would be something like [32, 1, 4096] @ [32, 4096, 16] @ [32, 16, 11008]
. The latter has far less computation (4096 vs 16).
Thank you,it’s easy to misunderstand at first glance
Hi @abcdabcd987 , does LoRA with Loop mean that we will calculate lora matrices for different lora adapters in a loop right? like one at a time. Thanks
Hi @abcdabcd987 , does LoRA with Loop mean that we will calculate lora matrices for different lora adapters in a loop right? like one at a time. Thanks
Correct. See: https://github.com/punica-ai/punica/blob/591b59899f0a20760821785d06b331c8a2e5cb86/benchmarks/bench_backbone_vs_lora.py#L38-L40
I notice in your figure, LoRA with SGMA is even faster than pretrained LLM. It looks like punica is faster than LLM with no-lora. Can you explain the details of expriment and evaluation?