punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
883 stars 40 forks source link

flashinfer shrink vs cutlass #25

Closed YLGH closed 7 months ago

YLGH commented 7 months ago

Hi, I really enjoyed learning about SGMV.

I was grokking through the code and wanted to check my understanding. It seems that there are two implementations of SGMV, one based on Grouped GEMM cutlass and another hand written one (using some utils from flashinfer). Just wondering, what is the performance benchmark between the two?

abcdabcd987 commented 7 months ago

Thanks for taking a close look!

We'll deprecate the cutlass implementation in the future. See discussions here: https://github.com/punica-ai/punica/issues/2

YLGH commented 7 months ago

Makes sense, thanks!

So it seems like the recommendation would be to use the hand written version for shrink:

https://github.com/punica-ai/punica/blob/master/csrc/sgmv_flashinfer/sgmv_flashinfer.cuh

and in the meantime use the cutlass based version for expand https://github.com/punica-ai/punica/blob/master/csrc/sgmv/sgmv_cutlass.cuh#L81C3-L81C3 ?

abcdabcd987 commented 7 months ago

Correct. Once we got time to push out custom expand, we'll deprecate cutlass. You can use punica.add_lora_sgmv_custom_cutlass() for LoRA for now.

Related: https://github.com/punica-ai/punica/issues/11

YLGH commented 7 months ago

Sounds great, thanks!

jcao-ai commented 7 months ago

@abcdabcd987 Can't wait to the customized version. So far we use the current version in production and performance seems good for multi-lora deployment.

abcdabcd987 commented 7 months ago

@jcao-ai Glad that Punica got deployed and serves your usage :)