punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
883 stars 40 forks source link

The smallest rank supported is 16? #17

Closed jcao-ai closed 7 months ago

jcao-ai commented 7 months ago

Hi, grateful for this nice work.

Confused that it seems only rank of 16, 32, 64 is supported right now? ref

abcdabcd987 commented 7 months ago

Correct. Would you mind clarify your use case? I'm interested to learn.

jcao-ai commented 7 months ago

Correct. Would you mind clarify your use case? I'm interested to learn.

Thanks for you quick response. Actually we are interested in serving lora adapters with variant number of ranks, ranging from 8 to 128 at the same time. BTW, your code implementation is quite neat and good.

abcdabcd987 commented 7 months ago

Thanks for the kind words. Kudos to @yzh119

Currently, the rank needs to be either 16 or multiples of 32. And all models need to have the same rank.

If I understand correctly, you have two usages that current Punica don't support.

  1. Run with ranks other than 16, 32, 64.
  2. Run different ranks in one batch.

Here's how you can mitigate it today if you don't want to be blocked by me: (essentially the same solution to both problems)

In the case of other ranks, my current suggestions is to round up the rank to the closest bin (e.g., 12 -> 16, 17 -> 32, 45 -> 64). In the case of running different ranks in one batch, my suggestion is to round up to the biggest rank. I haven't tested r=128, but I'd say if you round up all models to r=64, you'll see negligible overhead. This is my educated guess. I'll run a benchmark soon.

Here's why:

Figure 9 in our paper shows the latency difference across different ranks. The gist is that the latency difference is not that big, especially when a model is used more than once in the batch.

More importantly, Figure 10 shows that SGMV latency difference will be submerged by other parts in the transformer layer (self attention, dense projections of the base model). That's why I'd expect to see negligible overhead when you round up all models to r=64.

My TODO would be:

  1. Do some benchmarks to assure users that this mitigation is not slow.
  2. Improve API so that users don't need to implement this mitigation.
jcao-ai commented 7 months ago

@abcdabcd987 Thanks a lot for your explanation. Padding adapters into the same shape is indeed a workaround. I will take a try soon.

Further more, I got a question about BGMV and SGMV. Since there goes another project named S-LoRA which is based on this project and they seem to keep on the BGMV pattern and they declare quite good performance on throughput.

And I found some discuss in this repo about this topic.

Which pattern/path do you think is better for production deployment? Thanks again.

abcdabcd987 commented 7 months ago

BGMV is our first attempt on multi-LoRA serving, which assumes each input is for a different model. Although it was already a very good fit for this particular use case, we identified that BGMV has limitations and there are some other opportunities:

  1. Using BGMV for prefill is inefficient.
  2. The BGMV speed up primarily comes from utilizing more compute units on batched small computes. This free lunch has a limit.
  3. I believe there are use cases in the decode stage where multiple inputs map to the same LoRA models. For example, AAABCC is a batch of size 6, but with only 3 LoRA models. When such sharing exists, the speed up comes from one more dimension, i.e., increasing arithmetic intensity (Figure 7 in our paper). As such, the free lunch can go to even higher batch size (Figure 8 and 9).

That's why we created SGMV kernel. BGMV semantics is a strict subset of SGMV. And @yzh119 found a few cool tricks such that SGMV runs faster than BGMV even in the same semantics. Even if your use case is 100% BGMV semantics, we'd still recommend use SGMV. Just treat it as a faster and better-API version of SGMV. I'll do some benchmarks to clear the confusion.

I'm very glad that our idea and early implementation got recognized by the research community. Our early version (BGMV) was open sourced on 9/11. While we keep developing actively, we didn't push new commits to the OSS repo since 9/16. After we submitted the paper and cleaned up the code a little bit, we pushed latest versions (SGMV) on 11/2 (#1).