Open kuhar opened 6 months ago
Have you tried tensor constants and tensor.extract op? We are able to vectorize tensor.extract
using vector.gather/vector.transfer_read ops. To repro: iree-opt --pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{enable-vector-masking=false use-configured-vector-sizes=false}))" ~/z.mlir
func.func @main(%30 : tensor<?x64x128xf16>) -> tensor<?x28672xf16> {
%c0 = arith.constant 0 : index
%c32_i64 = arith.constant 32 : i64
%cst = arith.constant 0.000000e+00 : f16
%dim = tensor.dim %30, %c0 : tensor<?x64x128xf16>
%27 = util.unfoldable_constant dense<2> : tensor<28672x64x128xi4>
%28 = util.unfoldable_constant dense<1.4> : tensor<28672x64xf16>
%29 = util.unfoldable_constant dense<2.4> : tensor<28672x64xf16>
%31 = tensor.empty(%dim) : tensor<?x28672xf16>
%32 = tensor.empty() : tensor<28672x64x128xf16>
%nums = arith.constant dense<[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0]> : tensor<16xf16>
%33 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%27, %28, %29 : tensor<28672x64x128xi4>, tensor<28672x64xf16>, tensor<28672x64xf16>) outs(%32 : tensor<28672x64x128xf16>) {
^bb0(%in: i4, %in_0: f16, %in_1: f16, %out: f16):
%idx = arith.index_cast %in : i4 to index
%fp = tensor.extract %nums[%idx] : tensor<16xf16>
%38 = arith.subf %fp, %in_1 : f16
%39 = arith.mulf %38, %in_0 : f16
linalg.yield %39 : f16
} -> tensor<28672x64x128xf16>
%34 = linalg.fill ins(%cst : f16) outs(%31 : tensor<?x28672xf16>) -> tensor<?x28672xf16>
%35 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction", "reduction"]} ins(%30, %33 : tensor<?x64x128xf16>, tensor<28672x64x128xf16>) outs(%34 : tensor<?x28672xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%36 = arith.mulf %in, %in_0 : f16
%37 = arith.addf %36, %out : f16
linalg.yield %37 : f16
} -> tensor<?x28672xf16>
return %35 : tensor<?x28672xf16>
}
The sizes should look much better after tiling.
Thanks for the suggestion, @hanhanW. This makes more progress but fails to distribute:
dequant_lut.mlir:21:11: error: 'func.func' op failed to distribute
%35 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>], iterator_type
s = ["parallel", "parallel", "reduction", "reduction"]} ins(%30, %33 : tensor<?x64x128xf16>, tensor<28672x64x128xf16>) outs(%34 : tensor<?x28672xf16>) {
^
I'd need to check with your flags -- not sure if I can set them in iree-compile... This is low-priority BTW, I opened this to have a repro in case we need it in the future.
I see, feel free to reach out if there are any vectorization issues/questions. I'm happy to help.
The flags are available with iree-compile. You'll need to add the options like:
I tried to implement i4 -> f16 conversion with a lookup table implemented using a dynamic vector extract:
Compile command:
The first linalg generic doesn't get expanded like I'd expect in
iree-codegen-generic-vectorization
.This lowers fine if the lookup is replaced with: