Open JamesTheZ opened 1 year ago
I just want to measure the best performance of IREE for TensorCore codegen on MatMul op. Is there a tool to tune and measure the performance of single op?
There is warp level tiling in IREE. But I am not clear what you are looking for. @ThomasRaoux can maybe provide more details
I am studying the TensorCore GEMM codegen of IREE. I notice a big performance gap between IREE and cuBlas. For example, when [M, N, K] is [1024, 512, 1024], I use the following script to run GEMM:
With Nsight Compute tool, the duration is 62us. While the cuBlas version only takes 30.5us. Is this the expected performance? (
lowering_config = <tile_sizes = [[128, 128, 32]]>
is the block-level tiling configuration, am I right? Are there some other tuning factors to speed up the GEMM codegen in IREE?)I dig into the IREE passes about TensorCore GEMM codegen. I only find the block-level tiling. Is there warp-level, and even thread-level, tiling in the GEMM schedule like that in CUTLASS?