[LLVMCPU] Bad codegen for 1D layernorm on riscv64

ElEHsiang commented 6 months ago

What happened?

I benchmark the layernorm based on tests/e2e/regression/layernorm.mlir but changed the input dimension to 1D. The codegen has a lot vslidedown + vfmv + fmadd which impact the performance. It is caused by the lowering policy for vector.contract in LLVMCPUVectorLoweringPass. If I change the VectorContractLowering from OuterProduct to ParallelArith, the codegen can simply use vfmadd. And the speed is about 250% faster.

I concerned that changing the policy directly will impact other test cases, any suggestions to optimize it?

The mlir I used.

func.func @layernorm(%input: tensor<1x409600xf32>) -> tensor<1x409600xf32> {
  %c409600 = util.unfoldable_constant dense<409600.0> : tensor<1x1xf32>
  %sum = tosa.reduce_sum %input {axis = 1 : i32} : (tensor<1x409600xf32>) -> tensor<1x1xf32>
  %r409600 = tosa.reciprocal %c409600 : (tensor<1x1xf32>) -> tensor<1x1xf32>
  %mean = tosa.mul %sum, %r409600 {shift = 0 : i8} : (tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<1x1xf32>
  %x_sub_mean = tosa.sub %input, %mean : (tensor<1x409600xf32>, tensor<1x1xf32>) -> tensor<1x409600xf32>
  %square = tosa.mul %x_sub_mean, %x_sub_mean {shift = 0 : i8} : (tensor<1x409600xf32>, tensor<1x409600xf32>) -> tensor<1x409600xf32>
  %square_sum = tosa.reduce_sum %square {axis = 1 : i32} : (tensor<1x409600xf32>) -> tensor<1x1xf32>
  %variance = tosa.mul %square_sum, %r409600 {shift = 0 : i8} : (tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<1x1xf32>
  %epsilon = util.unfoldable_constant dense<9.99999996E-13> : tensor<1x1xf32>
  %var_eps = tosa.add %variance, %epsilon : (tensor<1x1xf32>, tensor<1x1xf32>) -> tensor<1x1xf32>
  %rsigma = tosa.rsqrt %var_eps : (tensor<1x1xf32>) -> tensor<1x1xf32>
  %norm = tosa.mul %x_sub_mean, %rsigma {shift = 0 : i8} : (tensor<1x409600xf32>, tensor<1x1xf32>) -> tensor<1x409600xf32>
  return %norm : tensor<1x409600xf32>
}

The codegen has a lot of this pattern

vfmv.f.s        fa4, v8
fmadd.s fa5, fa5, fa5, fa4
fsw     fa5, -204(s0)
flw     fa5, -92(a2)
vsetivli        zero, 1, e32, m1, ta, ma
vslidedown.vi   v10, v8, 1
vfmv.f.s        fa4, v10
fmadd.s fa5, fa5, fa5, fa4
fsw     fa5, -208(s0)

These is caused by the lowering of vector.contract

#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0, d1) -> (d1)>
...
  %7 = vector.contract {indexing_maps = [#map, #map, #map1], iterator_types = ["reduction", "parallel"], kind = #vector.kind<add  , %6, %arg1 : vector<1x32xf32>, vector<1x32xf32> into vector<32xf32>

mlir after LLVMCPUVectorLoweringPass with VectorContractLowering::OuterProduct

%11 = memref.load %expand_shape[%arg0, %c0] : memref<12x32xf32, strided<[32, 1], offset: 16>, #hal.descriptor_type<storage_buffer>>
%12 = vector.insert %11, %cst [0] : f32 into vector<1xf32>
%13 = vector.extract %arg1[0] : f32 from vector<32xf32>
%14 = arith.mulf %10, %12 : vector<1xf32>
%15 = vector.reduction <add>, %14, %13 : vector<1xf32> into f32
%16 = vector.insert %15, %cst_1 [0] : f32 into vector<32xf32>
/* repeat this pattern */

mlir after LLVMCPUVectorLoweringPass with VectorContractLowering::ParallelArith

%9 = vector.load %expand_shape[%arg0, %c0] : memref<12x32xf32, strided<[32, 1], offset: 16>, #hal.descriptor_type<storage_buffer>>, vector<32xf32>
%10 = vector.fma %9, %9, %arg1 : vector<32xf32>

Steps to reproduce your issue

iree-compile command

../iree-build/install/bin/iree-compile \
    --iree-hal-target-backends=llvm-cpu \
    --iree-input-type=tosa \
    --iree-llvmcpu-target-triple=riscv64 \
    --iree-llvmcpu-target-cpu=generic-rv64 \
    --iree-llvmcpu-target-abi=lp64d \
    --iree-llvmcpu-target-cpu-features="+m,+a,+f,+d,+zvl512b,+v" \
    --riscv-v-fixed-length-vector-lmul-max=8 \
    layernorm.mlir \
    -o layernorm.vmfb

What component(s) does this issue relate to?

MLIR

Version information

commit: 5cd1510a78e08ca16b8df2e3241a4c2d777ed653

Additional context

No response

rednoah91 commented 6 months ago

Hi @dcaballe, do you interested on taking a look at it?

dcaballe commented 5 months ago

(Sorry, catching up after break)

I think the problem could be related to the fact that we are reducing the outer dimension here:

  %7 = vector.contract {indexing_maps = [#map, #map, #map1], iterator_types = ["reduction", "parallel"], kind = #vector.kind<add  , %6, %arg1 : vector<1x32xf32>, vector<1x32xf32> into vector<32xf32>

and the outer-product strategy expects a very specific "matmul-like" contraction. If the contraction doesn't align with that it will convert it to that "matmul-like" form by changing the layout of the inputs and we may end up vectorizing a dimension that we shouldn't (see scalar code).

I would need to think a bit more about this but perhaps we should use a different strategy only when the contraction op is not suitable to be represented with an outer product.

iree-org / iree