Question about performance

jcao-ai commented 6 months ago

Hi guys, @abcdabcd987 @yzh119 Thanks again for this great project.

It is observed that the prediction time profiled is like 60% longer than the base bare model (without LoRA adapters).

Runtime info:

Llama-34B (Yi-34B)
LoRA.
- rank: 32
- lora_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - up_proj
  - gate_proj
  - down_proj

Folloing is the profiling info, each decoding task is composed of 5 decoding steps.

LoRA Inference

INFO:root:Time taken is 0.18477 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16798 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16362 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16338 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16323 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.

Bare Model Inference

INFO:root:Time taken is 0.12258 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10321 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10180 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10160 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10191 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10187 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10196 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10186 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.

It's like 60% slower if equipped with this LoRA adapter. Kind of curious is this expected ? :)

abcdabcd987 commented 6 months ago

I think it is somewhat reasonable.

Taking from your number, the difference is 0.16323 - 0.10186 seconds = 62 ms.
Divided by 5 steps that is 12.4 ms per step.
Divided by 60 layers of the 34B model, that is 206 us.
Divided by 7 dense projections, that is roughly 30 us overhead per LoRA kernel invocation.

30 us matches the measurement in this figure: https://github.com/punica-ai/punica/blob/master/assets/backbone-vs-sgmv.png

Punica enables serving multiple LoRA models at the cost of one LoRA model, not zero. (Although technically, if you are really just serving one LoRA model, you could merge the weight back to the base model, making it zero :)

Hope this clears your confusion.

Thanks for providing this measurement to cross check 👍

jcao-ai commented 6 months ago

Cool, it makes sense.

punica-ai / punica

Question about performance #27