Closed YLGH closed 7 months ago
Thanks for taking a close look!
We'll deprecate the cutlass implementation in the future. See discussions here: https://github.com/punica-ai/punica/issues/2
Makes sense, thanks!
So it seems like the recommendation would be to use the hand written version for shrink:
https://github.com/punica-ai/punica/blob/master/csrc/sgmv_flashinfer/sgmv_flashinfer.cuh
and in the meantime use the cutlass based version for expand https://github.com/punica-ai/punica/blob/master/csrc/sgmv/sgmv_cutlass.cuh#L81C3-L81C3 ?
Correct. Once we got time to push out custom expand, we'll deprecate cutlass. You can use punica.add_lora_sgmv_custom_cutlass()
for LoRA for now.
Sounds great, thanks!
@abcdabcd987 Can't wait to the customized version. So far we use the current version in production and performance seems good for multi-lora deployment.
@jcao-ai Glad that Punica got deployed and serves your usage :)
Hi, I really enjoyed learning about SGMV.
I was grokking through the code and wanted to check my understanding. It seems that there are two implementations of SGMV, one based on Grouped GEMM cutlass and another hand written one (using some utils from flashinfer). Just wondering, what is the performance benchmark between the two?