nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

Control of loop unrolling #873

Closed newling closed 1 week ago

newling commented 2 weeks ago

We're currently calling peano's version of opt here: https://github.com/nod-ai/iree-amd-aie/blob/f6482ae5dac14b6d116331df2a4b69b28c12559c/compiler/plugins/target/AMD-AIE/iree-amd-aie/Target/XCLBinGen.cpp#L1058

As you can see there, opt is called with the flags -O2 --inline-threshold=10.

With these flags, loops are unrolled very enthusiastically. This is fine for some workloads, but I think there are matmuls (@Abhishek-Varma @jtuyls) for which it would be good to not unroll quite so much.

Below are some data points on how much unrolling happens, and what flags we have at our disposal to control unrolling

Example 1.

conv.ll This is for the current vectorized convolution which runs on a single column (4 AIE cores). It has 3 nested loops of counts 4, 3, and 3, inside which there is a matmul.

less conv.ll | grep "acc32.mac.conf" | wc -l

9: (2 per core, because of ping-pong, and 1 for the func decl)

Running opt with the current flags:

./bin/opt -inline-threshold=10   -O2  -S  conv.ll  | grep "acc32.mac.conf" | wc -l

289: (72 per core. 72 = 233*4 -- so this full loop unrolling).

There are a few flags to control unrolling in llvm, see https://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ab5709dc220a64908090b46d1d1f6309b

For example using unroll-threshold with a high threshold can completely eliminate unrolling.

./bin/opt -inline-threshold=10 --unroll-threshold=190   -O2  -S  conv.ll  | grep "acc32.mac.conf" | wc -l

289

./bin/opt -inline-threshold=10 --unroll-threshold=0   -O2  -S  conv.ll  | grep "acc32.mac.conf" | wc -l

33

Others that might be relevant are unroll-count, unroll-max-count etc.

newling commented 1 week ago

Example 2.

Matmul m=n=k=1024 with input bf16 (output f32).

Looking at input.ll that is generated, we see in this example 4 cores:

grep -r "define void @core" input.ll  | wc -l

There are 8 calls to the (now outlined) matmul function:

 grep -r "matmul_0_outlined" input.ll  | wc -l

33 (and 33 = num_cores * matmuls_per_core + 1, so matmuls_per_core = 8)

 ${peano_opt}  -inline-threshold=10 --unroll-threshold=0   -O2  -S  input.ll  | grep "matmul_0_outlined" | wc -l

65 (16 per core, twice as before opt)

 ${peano_opt}  -inline-threshold=10 --unroll-threshold=256   -O2  -S  input.ll  | grep "matmul_0_outlined" | wc -l

65

 ${peano_opt}  -inline-threshold=10 --unroll-threshold=512   -O2  -S  input.ll  | grep "matmul_0_outlined" | wc -l

257

 ${peano_opt}  -inline-threshold=10 --unroll-threshold=654321   -O2  -S  input.ll  | grep "matmul_0_outlined" | wc -l

257