improve bgmv expand kernel performance

Hi, sorry to bother you again. I'm using bgmv in our llm serving system since the sgmv kernels not ready. I do some profiling on bgmv kernels and find that the performance of expand kernel is worse compared to shrink kernel. e.g. let's take batch_size = 4096, hidden_dim=4096 and lora_rank = 8. The performance profiled by ncu is shown below:

It seems that expand kernel's memory throughput is lower than shrink kernel. It's reasonable because all memory operations in expand kernel deal with global memory. My question is that why expand kernel doesn't use pipeline asynchronous memory copy from global to shared memory?

punica-ai / punica

improve bgmv expand kernel performance #47