I did the warmup using the tensor of same size/dtype so the autotune and compile time are not included in the figure.
Yet the results seem to suggest that the kernel launching/dispatching overhead is multiple times than the actual kernel running time.
1) is this expected?
2) if so, what is the recommended way to get around such overhead? Two ways I can think of are AOT compile and CUDA graph, yet they both have limitations. For AOT I think it expects me to provide the glue code (pt binding, C wrapper, etc.) for every kernel. So I tried CUDA graph. It does successfully get rid of the overhead shown in the figure. But it expects the tensors to be at the same address and shape which makes it less feasible in practice. Any suggestions? Thanks!
Hello triton team, I did a quick profiling on the triton matmul kernel https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py using pytorch profiler.
I did the warmup using the tensor of same size/dtype so the autotune and compile time are not included in the figure. Yet the results seem to suggest that the kernel launching/dispatching overhead is multiple times than the actual kernel running time. 1) is this expected? 2) if so, what is the recommended way to get around such overhead? Two ways I can think of are AOT compile and CUDA graph, yet they both have limitations. For AOT I think it expects me to provide the glue code (pt binding, C wrapper, etc.) for every kernel. So I tried CUDA graph. It does successfully get rid of the overhead shown in the figure. But it expects the tensors to be at the same address and shape which makes it less feasible in practice. Any suggestions? Thanks!