punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
961 stars 45 forks source link

Reasons for switching to CUTLASS-based kernel instead of custom kernel #2

Open Yard1 opened 10 months ago

Yard1 commented 10 months ago

Hey folks, awesome and really impactful work with the repo and the paper.

I was wondering what was the reason for switching from the original bgmv kernel to a CUTLASS-based sgmv one. I understand that one advantage of sgmv is that it doesn't require the LoRA tensors to be in a single contiguous block of memory, but aside from that, are there any performance considerations that made you switch?

I can also see that there is a custom sgmv shrink kernel implementation but the expand version is WIP. Is that something you are planning to work on in the near future?

Furthermore, do the performance results in the paper concern the CUTLASS kernel or the custom kernel? From the description of the implementation I inferred the later, but I was confused by the lack of the custom expand kernel in the repo.

Thanks, and great work!

abcdabcd987 commented 10 months ago

Really good questions. And thanks for taking a close look at our code.

The semantics of BGMV and SGMV differs. BGMV was our first attempt [1]. It was designed for the case where every input is for a different LoRA model. We realized that the performance improvement of BGMV comes solely from utilizing more compute units. It does not scale well. The "free lunch" has a limit. But if consider other use cases, where not every input is for a different model, we can extend this free lunch to much bigger batch size. You can see this difference in the roofline plot in the paper.

You are correct that our BGMV implementation requires a continuous weight block. But that is not our primary reason for using cutlass. We can also pass pointers to hand-written kernels as well, just like what we do for SGMV shrink. We use cutlass simply because we were curious about cutlass. We gave it a try. Cutlass turned out to work quite well for expand.

However, we were not able to obtain an efficient shrink from cutlass. We searched the whole configuration space. It just does not work. We also briefly tried Triton, but it didn't work either. So we rolled our own cuda implementation.

For the benchmarks, we use cutlass expand and our hand-written shrink. We will release the hand-written expand soon.

[1] https://le.qun.ch/en/blog/2023/09/11/multi-lora-potentials/

Yard1 commented 10 months ago

Awesome, thank you! Looking forward to custom expand.