SDXL Perf tracker for ROCM

Abhishek-Varma commented 1 year ago

Pip version used for fetching Tracy profiles :

iree-compiler             20231106.574
iree-runtime              20231106.574

I captured an end-to-end Tracy profile along with Unet for ROCM gfx90a.

End-to-end tracy - this entails [Clip + Clip2] -> [Unet x 50 iterations] -> [Vae]
Unet tracy - this entails [Unet x 10 iterations]

I will start off with optimising Unet. Currently on gfx90a it takes 6.8 seconds/iteration.

I have uploaded the dispatches here&authuser=0&prefix=&forceOnObjectsSortingFiltering=false) and following is the screenshot of 10 Unet iterations from iree-tracy-profiler (I'm not sure how to fix the resolution/font size of the window I receive from iree-tracy-profiler) :-

I tried looking into the main compute region of dispatch_3011 and this is what it looks like :-

%7 = tensor.empty() : tensor<2x640x16384xf16>
%8 = linalg.fill ins(%cst : f16) outs(%7 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3)>,
                                      affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
                                      affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], 
                     iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
          ins(%4, %5 : tensor<640x5760xf16>, tensor<2x5760x16384xf16>)
          outs(%8 : tensor<2x640x16384xf16>) {
      ^bb0(%in: f16, %in_0: f16, %out: f16):
        %11 = arith.mulf %in, %in_0 : f16
        %12 = arith.addf %11, %out : f16
        linalg.yield %12 : f16
      } -> tensor<2x640x16384xf16>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
                                       affine_map<(d0, d1, d2) -> (d1)>,
                                       affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
                      iterator_types = ["parallel", "parallel", "parallel"]}
          ins(%9, %6 : tensor<2x640x16384xf16>, tensor<640xf16>)
          outs(%7 : tensor<2x640x16384xf16>) {
      ^bb0(%in: f16, %in_0: f16, %out: f16):
        %11 = arith.addf %in, %in_0 : f16
        linalg.yield %11 : f16
      } -> tensor<2x640x16384xf16>
return %10

CC: @antiagainst @qedawkins

Abhishek-Varma commented 1 year ago

So, I created two individual IRs :-

The dispatch 3011's compute IR (iree-benchmark-module yields 145 ms):

func.func @dispatch_3011(%4: tensor<640x5760xf16>, %5: tensor<2x5760x16384xf16>, %6: tensor<640xf16>) -> tensor<2x640x16384xf16> {
%cst = arith.constant 0.0 : f16
%7 = tensor.empty() : tensor<2x640x16384xf16>
%8 = linalg.fill ins(%cst : f16) outs(%7 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d3)>,
                                    affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>,
                                    affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], 
                  iterator_types = ["parallel", "parallel", "parallel", "reduction"]}
        ins(%4, %5 : tensor<640x5760xf16>, tensor<2x5760x16384xf16>)
        outs(%8 : tensor<2x640x16384xf16>) {
    ^bb0(%in: f16, %in_0: f16, %out: f16):
      %11 = arith.mulf %in, %in_0 : f16
      %12 = arith.addf %11, %out : f16
      linalg.yield %12 : f16
    } -> tensor<2x640x16384xf16>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
                                    affine_map<(d0, d1, d2) -> (d1)>,
                                    affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
                    iterator_types = ["parallel", "parallel", "parallel"]}
        ins(%9, %6 : tensor<2x640x16384xf16>, tensor<640xf16>)
        outs(%7 : tensor<2x640x16384xf16>) {
    ^bb0(%in: f16, %in_0: f16, %out: f16):
      %11 = arith.addf %in, %in_0 : f16
      linalg.yield %11 : f16
    } -> tensor<2x640x16384xf16>
return %10 : tensor<2x640x16384xf16>
}

Since the first linalg.generic looks like a batch matmul except that the first tensor doesn't have a batch dimension, just to experiment I created an analogous IR with linalg.batch_matmul (iree-benchmark-module yields 424 ms) :

func.func @ideal(%0: tensor<2x640x5760xf16>, %1: tensor<2x5760x16384xf16>, %6: tensor<640xf16>) -> tensor<2x640x16384xf16> {
%cst = arith.constant 0.0 : f16
%7 = tensor.empty() : tensor<2x640x16384xf16>
%8 = linalg.fill ins(%cst : f16) outs(%7 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>

%3 = linalg.batch_matmul ins(%0, %1 : tensor<2x640x5760xf16>, tensor<2x5760x16384xf16>)
  outs(%8 : tensor<2x640x16384xf16>) -> tensor<2x640x16384xf16>

%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
                                    affine_map<(d0, d1, d2) -> (d1)>,
                                    affine_map<(d0, d1, d2) -> (d0, d1, d2)>],
                    iterator_types = ["parallel", "parallel", "parallel"]}
        ins(%3, %6 : tensor<2x640x16384xf16>, tensor<640xf16>)
        outs(%7 : tensor<2x640x16384xf16>) {
    ^bb0(%in: f16, %in_0: f16, %out: f16):
      %11 = arith.addf %in, %in_0 : f16
      linalg.yield %11 : f16
    } -> tensor<2x640x16384xf16>
return %3 : tensor<2x640x16384xf16>
}

antiagainst commented 1 year ago

Hey @Abhishek-Varma could you also paste the links for the full model mlir and their weigth elided version? Also commands you used to compile them.

antiagainst commented 1 year ago

The top time-consuming kernels in the Unet command buffer is actually all transposed matmul:

They are going down the normal SIMT pipeline at the moment; instead they should go down tensor/matrix core pipeline.

qedawkins commented 1 year ago

The top time-consuming kernels in the Unet command buffer is actually all transposed matmul:

They are going down the normal SIMT pipeline at the moment; instead they should go down tensor/matrix core pipeline.

This is in progress, trying to get WMMA support on ROCm. For now I don't think there's much we can do directly on that front. My thinking is that in the above, we're still using --iree-global-optimization-convert-conv2d-to-img2col. We should first do some work to phase out that flag and investigate whether we need to transpose the NCHW convolutions in the model, or if we should just codegen them directly.

Abhishek-Varma commented 1 year ago

Hi @antiagainst @qedawkins

Following would be my update/observation (most of which is already known to you guys) :- Regarding MatmulTensorCore :-

I tried looking into why SIMT is being chosen - translation_info = #iree_codegen.translation_info<LLVMGPUMatmulSimt> is being set at LLVMGPUSelectLoweringStrategy.
getRocmTargetInfo sets hasTF32TensorCore parameter of translateInfo as False.
And because of that within supportsTensorCore function - due to targetInfo.hasTF32TensorCore==False it exits and doesn't get to apply MatmulTensorCore* pipeline.
I tried setting targetInfo.hasTF32TensorCore=True to see what happens and it cascadingly leads to lowering errors (haven't looked into it though).

Regarding --iree-global-optimization-convert-conv2d-to-img2col :

I made a short script to print Unet's output for a single iteration since iree-run/benchmark-module for ROCM doesn't seem to run (filed the issue here - refer Section A.1).
Ran that for both set of vmfbs (one with the flag present, and the other without it) - albeit there was no compilation error (unlike the previous case) the output don't seem to match and perhaps we really require the flag to be present?

Abhishek-Varma commented 1 year ago

Hey @Abhishek-Varma could you also paste the links for the full model mlir and their weigth elided version? Also commands you used to compile them.

Hi @antiagainst

Here are the IRs :-

ScottTodd commented 4 months ago

This issue looks stale. Let's close it?

iree-org / iree

SDXL Perf tracker for ROCM #15526