Open hanhanW opened 9 months ago
note: the flatten is needed for LHS packing as well
I will use below three cases to drive the optimization work.
func.func @pack_i8(%source: tensor<?x?xi8>) -> tensor<?x?x16x2xi8> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%d0 = tensor.dim %source, %c0 : tensor<?x?xi8>
%d1 = tensor.dim %source, %c1 : tensor<?x?xi8>
%c16 = arith.constant 16 : index
%c2 = arith.constant 2 : index
%tiled_d0 = arith.ceildivui %d0, %c2 : index
%tiled_d1 = arith.ceildivui %d1, %c16 : index
%zero = arith.constant 0 : i8
%init_pack = tensor.empty(%tiled_d1, %tiled_d0) : tensor<?x?x16x2xi8>
%pack = tensor.pack %source
padding_value(%zero: i8)
outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [16, 2]
into %init_pack : tensor<?x?xi8> -> tensor<?x?x16x2xi8>
return %pack : tensor<?x?x16x2xi8>
}
func.func @pack_bf16(%source: tensor<?x?xbf16>) -> tensor<?x?x16x2xbf16> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%d0 = tensor.dim %source, %c0 : tensor<?x?xbf16>
%d1 = tensor.dim %source, %c1 : tensor<?x?xbf16>
%c16 = arith.constant 16 : index
%c2 = arith.constant 2 : index
%tiled_d0 = arith.ceildivui %d0, %c2 : index
%tiled_d1 = arith.ceildivui %d1, %c16 : index
%zero = arith.constant 0.000000e+00 : bf16
%init_pack = tensor.empty(%tiled_d1, %tiled_d0) : tensor<?x?x16x2xbf16>
%pack = tensor.pack %source
padding_value(%zero: bf16)
outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [16, 2]
into %init_pack : tensor<?x?xbf16> -> tensor<?x?x16x2xbf16>
return %pack : tensor<?x?x16x2xbf16>
}
// i4 can not be inputs and outputs types
func.func @pack_i4(%source: tensor<?x?x?xi8>) -> tensor<?x?x?x32x8xi8> {
%source_i4 = arith.trunci %source : tensor<?x?x?xi8> to tensor<?x?x?xi4>
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%d0 = tensor.dim %source_i4, %c0 : tensor<?x?x?xi4>
%d1 = tensor.dim %source_i4, %c1 : tensor<?x?x?xi4>
%d2 = tensor.dim %source_i4, %c2 : tensor<?x?x?xi4>
%c32 = arith.constant 32 : index
%c8 = arith.constant 8 : index
%tiled_d0 = arith.ceildivui %d0, %c32 : index
%tiled_d2 = arith.ceildivui %d2, %c8 : index
%zero = arith.constant 0 : i4
%init_pack = tensor.empty(%d1, %tiled_d0, %tiled_d2) : tensor<?x?x?x32x8xi4>
%pack = tensor.pack %source_i4
padding_value(%zero: i4)
outer_dims_perm = [1, 0, 2] inner_dims_pos = [0, 2] inner_tiles = [32, 8]
into %init_pack : tensor<?x?x?xi4> -> tensor<?x?x?x32x8xi4>
%res = arith.extsi %pack : tensor<?x?x?x32x8xi4> to tensor<?x?x?x32x8xi8>
return %res : tensor<?x?x?x32x8xi8>
}
Didn't we already have a pass to make the innermost dimension larger?
Didn't we already have a pass to make the innermost dimension larger?
Yes, we have. The patterns will make the innermost dimension as larger as possible, i.e., it flattens it to a big 1-D vector. Flattening them to 1D vectors seems to have huge compilation time issue (https://github.com/openxla/iree/pull/16239). I will need some time to investigate it, and I think we want some control here.
That's surprising as we should effectively be unrolling less... It's only one model so maybe there's a collateral effect happening...
https://github.com/iree-org/iree/pull/16456 should address the issue. I'll revisit how to land the PR
The VectorTransferLowering generates inefficient vector.store because the innermost dim is 2xi8. It is fully unrolled.
The potential fix is to flatten innermost dims (with memref.collapse_shape) up to vector length. We have some support upstream, but need to add the control on the patterns. Otherwise, it generates a big 1D vector.
To repro: run
iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-target-triple=x86_64-unknown-linux-gnu ~/repro.mlir -o /tmp/a.vmfb