Hi, sorry to bother you again.
I'm using bgmv in our llm serving system since the sgmv kernels not ready. I do some profiling on bgmv kernels and find that the performance of expand kernel is worse compared to shrink kernel.
e.g. let's take batch_size = 4096, hidden_dim=4096 and lora_rank = 8. The performance profiled by ncu is shown below:
It seems that expand kernel's memory throughput is lower than shrink kernel. It's reasonable because all memory operations in expand kernel deal with global memory.
My question is that why expand kernel doesn't use pipeline asynchronous memory copy from global to shared memory?
After studying your code, i find that the expand kernel indeed can not use pipeline... :cold_sweat:
Is there any way to improve it ...? or maybe just turn to learn your sgmv kernels first
thanks for the code btw
Hi, sorry to bother you again. I'm using bgmv in our llm serving system since the sgmv kernels not ready. I do some profiling on bgmv kernels and find that the performance of expand kernel is worse compared to shrink kernel. e.g. let's take batch_size = 4096, hidden_dim=4096 and lora_rank = 8. The performance profiled by ncu is shown below:
It seems that expand kernel's memory throughput is lower than shrink kernel. It's reasonable because all memory operations in expand kernel deal with global memory. My question is that why expand kernel doesn't use pipeline asynchronous memory copy from global to shared memory?