punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
883 stars 40 forks source link

Question about performance #27

Closed jcao-ai closed 6 months ago

jcao-ai commented 6 months ago

Hi guys, @abcdabcd987 @yzh119 Thanks again for this great project.

It is observed that the prediction time profiled is like 60% longer than the base bare model (without LoRA adapters).

Runtime info:

Folloing is the profiling info, each decoding task is composed of 5 decoding steps.

LoRA Inference

INFO:root:Time taken is 0.18477 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16798 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16362 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16338 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16323 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.

Bare Model Inference

INFO:root:Time taken is 0.12258 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10321 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10180 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10160 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10191 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10187 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10196 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10186 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.

It's like 60% slower if equipped with this LoRA adapter. Kind of curious is this expected ? :)

abcdabcd987 commented 6 months ago

I think it is somewhat reasonable.

30 us matches the measurement in this figure: https://github.com/punica-ai/punica/blob/master/assets/backbone-vs-sgmv.png

Punica enables serving multiple LoRA models at the cost of one LoRA model, not zero. (Although technically, if you are really just serving one LoRA model, you could merge the weight back to the base model, making it zero :)

Hope this clears your confusion.

Thanks for providing this measurement to cross check 👍

jcao-ai commented 6 months ago

Cool, it makes sense.