punica-ai / punica

Serving multiple LoRA finetuned LLM as one
https://arxiv.org/abs/2310.18547
Apache License 2.0
883 stars 40 forks source link

improve bgmv expand kernel performance #47

Closed yyccli closed 3 months ago

yyccli commented 3 months ago

Hi, sorry to bother you again. I'm using bgmv in our llm serving system since the sgmv kernels not ready. I do some profiling on bgmv kernels and find that the performance of expand kernel is worse compared to shrink kernel. e.g. let's take batch_size = 4096, hidden_dim=4096 and lora_rank = 8. The performance profiled by ncu is shown below:

image

It seems that expand kernel's memory throughput is lower than shrink kernel. It's reasonable because all memory operations in expand kernel deal with global memory. My question is that why expand kernel doesn't use pipeline asynchronous memory copy from global to shared memory?

yyccli commented 3 months ago

After studying your code, i find that the expand kernel indeed can not use pipeline... :cold_sweat: Is there any way to improve it ...? or maybe just turn to learn your sgmv kernels first thanks for the code btw