Closed Abhishek-Varma closed 4 months ago
So, I created two individual IRs :-
The dispatch 3011's compute IR (iree-benchmark-module
yields 145 ms):
func.func @dispatch_3011(%4: tensor<640x5760xf16>, %5: tensor<2x5760x16384xf16>, %6: tensor<640xf16>) -> tensor<2x640x16384xf16> {
%cst = arith.constant 0.0 : f16
%7 = tensor.empty() : tensor<2x640x16384xf16>
%8 = linalg.fill ins(%cst : f16) outs(%7 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3)>,
affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>],
iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
ins(%4, %5 : tensor<640x5760xf16>, tensor<2x5760x16384xf16>)
outs(%8 : tensor<2x640x16384xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%11 = arith.mulf %in, %in_0 : f16
%12 = arith.addf %11, %out : f16
linalg.yield %12 : f16
} -> tensor<2x640x16384xf16>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
affine_map<(d0, d1, d2) -> (d1)>,
affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
iterator_types = ["parallel", "parallel", "parallel"]}
ins(%9, %6 : tensor<2x640x16384xf16>, tensor<640xf16>)
outs(%7 : tensor<2x640x16384xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%11 = arith.addf %in, %in_0 : f16
linalg.yield %11 : f16
} -> tensor<2x640x16384xf16>
return %10 : tensor<2x640x16384xf16>
}
Since the first linalg.generic
looks like a batch matmul except that the first tensor doesn't have a batch dimension, just to experiment I created an analogous IR with linalg.batch_matmul
(iree-benchmark-module
yields 424 ms) :
func.func @ideal(%0: tensor<2x640x5760xf16>, %1: tensor<2x5760x16384xf16>, %6: tensor<640xf16>) -> tensor<2x640x16384xf16> {
%cst = arith.constant 0.0 : f16
%7 = tensor.empty() : tensor<2x640x16384xf16>
%8 = linalg.fill ins(%cst : f16) outs(%7 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>
%3 = linalg.batch_matmul ins(%0, %1 : tensor<2x640x5760xf16>, tensor<2x5760x16384xf16>)
outs(%8 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
affine_map<(d0, d1, d2) -> (d1)>,
affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
iterator_types = ["parallel", "parallel", "parallel"]}
ins(%3, %6 : tensor<2x640x16384xf16>, tensor<640xf16>)
outs(%7 : tensor<2x640x16384xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%11 = arith.addf %in, %in_0 : f16
linalg.yield %11 : f16
} -> tensor<2x640x16384xf16>
return %3 : tensor<2x640x16384xf16>
}
Hey @Abhishek-Varma could you also paste the links for the full model mlir and their weigth elided version? Also commands you used to compile them.
The top time-consuming kernels in the Unet command buffer is actually all transposed matmul:
They are going down the normal SIMT pipeline at the moment; instead they should go down tensor/matrix core pipeline.
The top time-consuming kernels in the Unet command buffer is actually all transposed matmul:
They are going down the normal SIMT pipeline at the moment; instead they should go down tensor/matrix core pipeline.
This is in progress, trying to get WMMA support on ROCm. For now I don't think there's much we can do directly on that front. My thinking is that in the above, we're still using --iree-global-optimization-convert-conv2d-to-img2col
. We should first do some work to phase out that flag and investigate whether we need to transpose the NCHW convolutions in the model, or if we should just codegen them directly.
Hi @antiagainst @qedawkins
Following would be my update/observation (most of which is already known to you guys) :-
Regarding MatmulTensorCore
:-
translation_info = #iree_codegen.translation_info<LLVMGPUMatmulSimt>
is being set at LLVMGPUSelectLoweringStrategy
.getRocmTargetInfo
sets hasTF32TensorCore
parameter of translateInfo
as False
.supportsTensorCore
function - due to targetInfo.hasTF32TensorCore==False it exits and doesn't get to apply MatmulTensorCore*
pipeline.targetInfo.hasTF32TensorCore=True
to see what happens and it cascadingly leads to lowering errors (haven't looked into it though).Regarding --iree-global-optimization-convert-conv2d-to-img2col
:
iree-run/benchmark-module
for ROCM doesn't seem to run (filed the issue here - refer Section A.1
).Hey @Abhishek-Varma could you also paste the links for the full model mlir and their weigth elided version? Also commands you used to compile them.
Hi @antiagainst
Here are the IRs :-
This issue looks stale. Let's close it?
Pip version used for fetching Tracy profiles :
I captured an end-to-end Tracy profile along with Unet for ROCM
gfx90a
.I will start off with optimising Unet. Currently on
gfx90a
it takes 6.8 seconds/iteration.I have uploaded the dispatches here&authuser=0&prefix=&forceOnObjectsSortingFiltering=false) and following is the screenshot of 10 Unet iterations from
iree-tracy-profiler
(I'm not sure how to fix the resolution/font size of the window I receive fromiree-tracy-profiler
) :-I tried looking into the main compute region of dispatch_3011 and this is what it looks like :-
CC: @antiagainst @qedawkins