Closed jcao-ai closed 6 months ago
I think it is somewhat reasonable.
30 us matches the measurement in this figure: https://github.com/punica-ai/punica/blob/master/assets/backbone-vs-sgmv.png
Punica enables serving multiple LoRA models at the cost of one LoRA model, not zero. (Although technically, if you are really just serving one LoRA model, you could merge the weight back to the base model, making it zero :)
Hope this clears your confusion.
Thanks for providing this measurement to cross check 👍
Cool, it makes sense.
Hi guys, @abcdabcd987 @yzh119 Thanks again for this great project.
It is observed that the prediction time profiled is like 60% longer than the base bare model (without LoRA adapters).
Runtime info:
rank
: 32lora_modules
:Folloing is the profiling info, each decoding task is composed of 5 decoding steps.
LoRA Inference
Bare Model Inference
It's like 60% slower if equipped with this LoRA adapter. Kind of curious is this expected ? :)