iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.85k stars 614 forks source link

Generic vectorization doesn't handle dynamic vector extracts #17266

Open kuhar opened 6 months ago

kuhar commented 6 months ago

I tried to implement i4 -> f16 conversion with a lookup table implemented using a dynamic vector extract:

func.func @main(%30 : tensor<?x64x128xf16>) -> tensor<?x28672xf16> {
    %c0 = arith.constant 0 : index
    %c32_i64 = arith.constant 32 : i64
    %cst = arith.constant 0.000000e+00 : f16
    %dim = tensor.dim %30, %c0 : tensor<?x64x128xf16>
    %27 = util.unfoldable_constant dense<2> : tensor<28672x64x128xi4>
    %28 = util.unfoldable_constant dense<1.4> : tensor<28672x64xf16>
    %29 = util.unfoldable_constant dense<2.4> : tensor<28672x64xf16>
    %31 = tensor.empty(%dim) : tensor<?x28672xf16>
    %32 = tensor.empty() : tensor<28672x64x128xf16>
    %nums = arith.constant dense<[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0]> : vector<16xf16>
    %33 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%27, %28, %29 : tensor<28672x64x128xi4>, tensor<28672x64xf16>, tensor<28672x64xf16>) outs(%32 : tensor<28672x64x128xf16>) {
    ^bb0(%in: i4, %in_0: f16, %in_1: f16, %out: f16):
        %idx = arith.index_cast %in : i4 to index
        %fp = vector.extract %nums[%idx] : f16 from vector<16xf16>
        %38 = arith.subf %fp, %in_1 : f16
        %39 = arith.mulf %38, %in_0 : f16
        linalg.yield %39 : f16
    } -> tensor<28672x64x128xf16>
    %34 = linalg.fill ins(%cst : f16) outs(%31 : tensor<?x28672xf16>) -> tensor<?x28672xf16>
    %35 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction", "reduction"]} ins(%30, %33 : tensor<?x64x128xf16>, tensor<28672x64x128xf16>) outs(%34 : tensor<?x28672xf16>) {
    ^bb0(%in: f16, %in_0: f16, %out: f16):
        %36 = arith.mulf %in, %in_0 : f16
        %37 = arith.addf %36, %out : f16
        linalg.yield %37 : f16
    } -> tensor<?x28672xf16>
    return %35 : tensor<?x28672xf16>
}

Compile command:

./tools/iree-compile input.mlir \
  --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 \
  --iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-preprocessing-pad-to-intrinsics))" \
  --iree-codegen-llvmgpu-use-vector-distribution \
  --iree-llvmgpu-enable-prefetch \
  --iree-stream-resource-max-allocation-size=4294967296 \
  --mlir-disable-threading --verify=true

The first linalg generic doesn't get expanded like I'd expect in iree-codegen-generic-vectorization.

This lowers fine if the lookup is replaced with:

        %36 = arith.extui %in : i4 to i32
        %37 = arith.uitofp %36 : i32 to f16
hanhanW commented 6 months ago

Have you tried tensor constants and tensor.extract op? We are able to vectorize tensor.extract using vector.gather/vector.transfer_read ops. To repro: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{enable-vector-masking=false use-configured-vector-sizes=false}))" ~/z.mlir

func.func @main(%30 : tensor<?x64x128xf16>) -> tensor<?x28672xf16> {
    %c0 = arith.constant 0 : index
    %c32_i64 = arith.constant 32 : i64
    %cst = arith.constant 0.000000e+00 : f16
    %dim = tensor.dim %30, %c0 : tensor<?x64x128xf16>
    %27 = util.unfoldable_constant dense<2> : tensor<28672x64x128xi4>
    %28 = util.unfoldable_constant dense<1.4> : tensor<28672x64xf16>
    %29 = util.unfoldable_constant dense<2.4> : tensor<28672x64xf16>
    %31 = tensor.empty(%dim) : tensor<?x28672xf16>
    %32 = tensor.empty() : tensor<28672x64x128xf16>
    %nums = arith.constant dense<[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0]> : tensor<16xf16>
    %33 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%27, %28, %29 : tensor<28672x64x128xi4>, tensor<28672x64xf16>, tensor<28672x64xf16>) outs(%32 : tensor<28672x64x128xf16>) {
    ^bb0(%in: i4, %in_0: f16, %in_1: f16, %out: f16):
        %idx = arith.index_cast %in : i4 to index
        %fp = tensor.extract %nums[%idx] : tensor<16xf16>
        %38 = arith.subf %fp, %in_1 : f16
        %39 = arith.mulf %38, %in_0 : f16
        linalg.yield %39 : f16
    } -> tensor<28672x64x128xf16>
    %34 = linalg.fill ins(%cst : f16) outs(%31 : tensor<?x28672xf16>) -> tensor<?x28672xf16>
    %35 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction", "reduction"]} ins(%30, %33 : tensor<?x64x128xf16>, tensor<28672x64x128xf16>) outs(%34 : tensor<?x28672xf16>) {
    ^bb0(%in: f16, %in_0: f16, %out: f16):
        %36 = arith.mulf %in, %in_0 : f16
        %37 = arith.addf %36, %out : f16
        linalg.yield %37 : f16
    } -> tensor<?x28672xf16>
    return %35 : tensor<?x28672xf16>
}

The sizes should look much better after tiling.

kuhar commented 6 months ago

Thanks for the suggestion, @hanhanW. This makes more progress but fails to distribute:

dequant_lut.mlir:21:11: error: 'func.func' op failed to distribute                                                                                                                             
    %35 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>], iterator_type
s = ["parallel", "parallel", "reduction", "reduction"]} ins(%30, %33 : tensor<?x64x128xf16>, tensor<28672x64x128xf16>) outs(%34 : tensor<?x28672xf16>) {                                       
          ^     

I'd need to check with your flags -- not sure if I can set them in iree-compile... This is low-priority BTW, I opened this to have a repro in case we need it in the future.

hanhanW commented 6 months ago

I see, feel free to reach out if there are any vectorization issues/questions. I'm happy to help.

The flags are available with iree-compile. You'll need to add the options like:

https://github.com/iree-org/iree/blob/efed94f052b3a7e5d078812c379b2a0901502f66/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L397-L405