iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.48k stars 553 forks source link

The parallel dimensions are not collapsed in unpack + elementwsie fusion #17594

Open hanhanW opened 1 month ago

hanhanW commented 1 month ago

The below snippet is dumped from unet model. There are four parallel loops in the element-wise op, and they can be collapsed into two loops. The reason to keep two loops is that it can be aligned with unpack ops. Originally, I think it is a missing pattern at flow level. However, it is also codegen's responsibility to collapse the parallel loops. Because the unpack op could be unset_encoding and get materialized in codegen. Flow does not know how to collapse dims when there is an unset_encoding op. So the CPU backend needs to handle it.

The form is very bad for codegen because the outer two dimensions are tied to tensor.unpack ops, so they must be aligned with inner tile sizes. It introduces bad tiling config (e.g., [2, 64, 1, 16]) on generic op, which leads to significant amount of IRs after vector lowering. E.g., there are 19K IRs at llvm dialect level in this case.

If we can collapse the the inner two dimensions, codegen can be much easier. The [2, 64] tile sizes would be chosen in this example.

hal.executable public @run_forward$async_dispatch_33 {
  hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "znver4", cpu_features = "+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+sse4a,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vbmi,+avx512ifma,+avx512vpopcntdq,+avx512vbmi2,+gfni,+vpclmulqdq,+avx512vnni,+avx512bitalg,+avx512bf16,+adx,+clflushopt,+clwb,+clzero,+cx16,+cx8,+f16c,+fsgsbase,+crc32,+invpcid,+rdpru,+sahf,+lzcnt,+movbe,+mwaitx,+x87,+pku,+evex512,+prfchw,+rdpid,+rdrnd,+rdseed,+sha,+shstk,+vaes,+wbnoinvd,+xsave,+xsavec,+xsaveopt,+xsaves,+fxsr", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 64 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>) {
    hal.executable.export public @run_forward$async_dispatch_33_unpack_generic_2x320x128x128_f32xf32xf32xf32xf16 ordinal(0) layout(#hal.pipeline.layout<push_constants = 7, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @run_forward$async_dispatch_33_unpack_generic_2x320x128x128_f32xf32xf32xf32xf16() {
        %c32_i64 = arith.constant 32 : i64
        %0 = hal.interface.constant.load[0] : i32
        %1 = hal.interface.constant.load[1] : i32
        %2 = hal.interface.constant.load[2] : i32
        %3 = hal.interface.constant.load[3] : i32
        %4 = hal.interface.constant.load[4] : i32
        %5 = hal.interface.constant.load[5] : i32
        %6 = hal.interface.constant.load[6] : i32
        %7 = arith.index_castui %0 : i32 to index
        %8 = arith.index_castui %1 : i32 to index
        %9 = arith.extui %2 : i32 to i64
        %10 = arith.extui %3 : i32 to i64
        %11 = arith.shli %10, %c32_i64 : i64
        %12 = arith.ori %9, %11 : i64
        %13 = arith.index_castui %12 {stream.alignment = 64 : index, stream.values = [14604672 : index, 15427712 : index, 4466777792 : index, 4467602112 : index, 4468426432 : index]} : i64 to index
        %14 = arith.extui %4 : i32 to i64
        %15 = arith.extui %5 : i32 to i64
        %16 = arith.shli %15, %c32_i64 : i64
        %17 = arith.ori %14, %16 : i64
        %18 = arith.index_castui %17 {stream.alignment = 64 : index, stream.values = [14605952 : index, 15428992 : index, 4466779072 : index, 4467603392 : index, 4468427712 : index]} : i64 to index
        %19 = arith.index_castui %6 : i32 to index
        %20 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%7) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x5x2x64xf32>>
        %21 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%8) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2x320x128x128xf32>>
        %22 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%13) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<320xf32>>
        %23 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%18) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<320xf32>>
        %24 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%19) : !flow.dispatch.tensor<writeonly:tensor<2x320x128x128xf16>>
        %25 = flow.dispatch.tensor.load %20, offsets = [0, 0, 0, 0], sizes = [1, 5, 2, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x5x2x64xf32>> -> tensor<1x5x2x64xf32>
        %26 = flow.dispatch.tensor.load %21, offsets = [0, 0, 0, 0], sizes = [2, 320, 128, 128], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x320x128x128xf32>> -> tensor<2x320x128x128xf32>
        %27 = flow.dispatch.tensor.load %22, offsets = [0], sizes = [320], strides = [1] : !flow.dispatch.tensor<readonly:tensor<320xf32>> -> tensor<320xf32>
        %28 = flow.dispatch.tensor.load %23, offsets = [0], sizes = [320], strides = [1] : !flow.dispatch.tensor<readonly:tensor<320xf32>> -> tensor<320xf32>
        %29 = tensor.empty() : tensor<2x320x128x128xf16>
        %30 = tensor.empty() : tensor<2x320xf32>
        %unpack = tensor.unpack %25 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [2, 64] into %30 : tensor<1x5x2x64xf32> -> tensor<2x320xf32>
        %31 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>,
                                               affine_map<(d0, d1, d2, d3) -> (d1)>,
                                               affine_map<(d0, d1, d2, d3) -> (d0, d1)>,
                                               affine_map<(d0, d1, d2, d3) -> (d1)>,
                                               affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>],
                              iterator_types = ["parallel", "parallel", "parallel", "parallel"]}
                             ins(%26, %27, %unpack, %28 : tensor<2x320x128x128xf32>, tensor<320xf32>, tensor<2x320xf32>, tensor<320xf32>)
                             outs(%29 : tensor<2x320x128x128xf16>) {
        ^bb0(%in: f32, %in_0: f32, %in_1: f32, %in_2: f32, %out: f16):
          %32 = arith.addf %in_1, %in_2 : f32
          %33 = arith.addf %in, %in_0 : f32
          %34 = arith.truncf %32 : f32 to f16
          %35 = arith.truncf %33 : f32 to f16
          %36 = arith.addf %35, %34 : f16
          linalg.yield %36 : f16
        } -> tensor<2x320x128x128xf16>
        flow.dispatch.tensor.store %31, %24, offsets = [0, 0, 0, 0], sizes = [2, 320, 128, 128], strides = [1, 1, 1, 1] : tensor<2x320x128x128xf16> -> !flow.dispatch.tensor<writeonly:tensor<2x320x128x128xf16>>
        return
      }
    }
  }
}

I also found that the inner reduction dims are not collapsed into a single reduction loop in the previous sprint. We can also implement it at codegen level. What we want is having a pass that collapses innermost loops as much as possible. @MaheshRavishankar what do you think?

hanhanW commented 1 month ago

Actually this can happen at flow level. I was wrong about tensor types. The result type of unset_encoding does not have encodings. So it should work for below both cases.

%0 = tensor.unpack .... -> tensor<2x320xf32>
%1 = linalg.generic ... ins(%0...
%0 = iree_linalg_ext.unset_encoding .... -> tensor<2x320xf32>
%1 = linalg.generic ... ins(%0...

@IanWood1 is it something you can pick up? We can improve CollapseDimensions to support the case.

IanWood1 commented 1 month ago

Yes, I can pick this up

MaheshRavishankar commented 1 month ago

I looked at this more. I am not sure you can collapse the iterations here

 %31 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>,
                                               affine_map<(d0, d1, d2, d3) -> (d1)>,
                                               affine_map<(d0, d1, d2, d3) -> (d0, d1)>,
                                               affine_map<(d0, d1, d2, d3) -> (d1)>,
                                               affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>],
                              iterator_types = ["parallel", "parallel", "parallel", "parallel"]}
                             ins(%26, %27, %unpack, %28 : tensor<2x320x128x128xf32>, tensor<320xf32>, tensor<2x320xf32>, tensor<320xf32>)
                             outs(%29 : tensor<2x320x128x128xf16>) {
        ^bb0(%in: f32, %in_0: f32, %in_1: f32, %in_2: f32, %out: f16):
          %32 = arith.addf %in_1, %in_2 : f32
          %33 = arith.addf %in, %in_0 : f32
          %34 = arith.truncf %32 : f32 to f16
          %35 = arith.truncf %33 : f32 to f16
          %36 = arith.addf %35, %34 : f16
          linalg.yield %36 : f16
        } -> tensor<2x320x128x128xf16>

Basically you are saying you want to collapse (d0, d1) dimensions. But since d1 is used by itself you will have to use mods to get back value of d1 from the collapsed op. That will have unintended/cascading consequences further down. None of these dimensions of the iterations space can be collapsed....

hanhanW commented 1 month ago

I want to collapse d2, d3, so it becomes

(d0, d1) -> (d0, d1),
(d0, d1) -> (d1),
(d0, d1) -> (d0, d1),
(d0, d1) -> (d1),
(d0, d1) -> (d0, d1)

I think it is doable?

MaheshRavishankar commented 1 month ago

I want to collapse d2, d3, so it becomes

(d0, d1) -> (d0, d1),
(d0, d1) -> (d1),
(d0, d1) -> (d0, d1),
(d0, d1) -> (d1),
(d0, d1) -> (d0, d1)

I think it is doable?

Oh yeah, that should be doable... Sorry for the digression.

Currently CollapseDimensions pass only handles dispatches with single operation. I tried a "very general approach" which turned out to be too difficuly, but maybe something more local can be done.

@IanWood1 lets chat next week. I can give you more context.

hanhanW commented 3 weeks ago

Hey @IanWood1 what is the status of the issue?

IanWood1 commented 1 week ago

I was looking into the CI failure https://github.com/iree-org/iree/actions/runs/9665068317/job/26661780089?pr=17725. The fault originated in https://github.com/llvm/llvm-project/blob/6b1c51bc052ae974e89e623b3d143d010fd09222/mlir/lib/Dialect/Vector/Transforms/VectorDistribute.cpp#L1696. Running this mlir through iree-codegen-vector-reduction-to-gpu will reproduce the problem.

The pass is trying to sink the trailing scf.for out of the WarpExecuteOnLane0Op (the dump of the vector.warp_execute_on_lane_0 immediately before failure is here). The failure is because the vector, vector<1xf32>, is not distributable among threads in the warp. This is the IR when running iree before the changes https://gist.github.com/IanWood1/0070caf994707fbf84a08ff04ec43cb5.

As far as I can tell, this isn't a correctness issue with my changes. But since I'm not familiar with this area, I can't be certain if this is an issue with my implementation or not. It seems like the 1 element vector should be convertible to a scalar or just not be distributed. I was able to successfully compile when disabling distribution of the 1elem vector (but I cant speak to correctness)

hanhanW commented 1 week ago

Can you share the executable source artifact? I.e., the dispatch in the output of --iree-hal-dump-executable-sources-to=/tmp/dump.

IanWood1 commented 1 week ago

https://gist.github.com/IanWood1/827aee042ff6968070d2d0c6beccae4a

hanhanW commented 1 week ago

https://gist.github.com/IanWood1/827aee042ff6968070d2d0c6beccae4a

It looks complicated.. @MaheshRavishankar do we expect to have elementwise -> reduction -> elementwise -> reduction -> elementwise dispatch?

@Groverkss would you be able to help fix it?

To repro: run iree-compile --output-format=vm-bytecode --compile-from=executable-sources ~/repro.mlir -o /tmp/a.vmfb.

MaheshRavishankar commented 1 week ago

https://gist.github.com/IanWood1/827aee042ff6968070d2d0c6beccae4a

It looks complicated.. @MaheshRavishankar do we expect to have elementwise -> reduction -> elementwise -> reduction -> elementwise dispatch?

@Groverkss would you be able to help fix it?

To repro: run iree-compile --output-format=vm-bytecode --compile-from=executable-sources ~/repro.mlir -o /tmp/a.vmfb.

Yeah this is expected. It is basically softmax fusion

hanhanW commented 1 week ago

Okay... I just found that it is enabled under some flags. I'm seeing that the benchmark is using additional flags (e.g., --iree-flow-enable-aggressive-fusion) which falls to experimental benchmark suites. I don't think that it should be a blocker for @IanWood1 's PR because it is experimental and it is blocking other work.

Is there a way to unblock this? I've been wanting it for three weeks. Can we make it happen by default, and add a flag to disable it in the benchmark?

hanhanW commented 1 week ago

https://gist.github.com/IanWood1/827aee042ff6968070d2d0c6beccae4a

To repro: run iree-compile --output-format=vm-bytecode --compile-from=executable-sources ~/repro.mlir -o /tmp/a.vmfb.

@bangtianliu Mahesh mentioned that you could help on this. Can you take a look?