Closed newling closed 1 week ago
Matmul m=n=k=1024 with input bf16 (output f32).
Looking at input.ll that is generated, we see in this example 4 cores:
grep -r "define void @core" input.ll | wc -l
There are 8 calls to the (now outlined) matmul function:
grep -r "matmul_0_outlined" input.ll | wc -l
33 (and 33 = num_cores * matmuls_per_core + 1, so matmuls_per_core = 8)
${peano_opt} -inline-threshold=10 --unroll-threshold=0 -O2 -S input.ll | grep "matmul_0_outlined" | wc -l
65 (16 per core, twice as before opt)
${peano_opt} -inline-threshold=10 --unroll-threshold=256 -O2 -S input.ll | grep "matmul_0_outlined" | wc -l
65
${peano_opt} -inline-threshold=10 --unroll-threshold=512 -O2 -S input.ll | grep "matmul_0_outlined" | wc -l
257
${peano_opt} -inline-threshold=10 --unroll-threshold=654321 -O2 -S input.ll | grep "matmul_0_outlined" | wc -l
257
We're currently calling peano's version of
opt
here: https://github.com/nod-ai/iree-amd-aie/blob/f6482ae5dac14b6d116331df2a4b69b28c12559c/compiler/plugins/target/AMD-AIE/iree-amd-aie/Target/XCLBinGen.cpp#L1058As you can see there,
opt
is called with the flags-O2 --inline-threshold=10
.With these flags, loops are unrolled very enthusiastically. This is fine for some workloads, but I think there are matmuls (@Abhishek-Varma @jtuyls) for which it would be good to not unroll quite so much.
Below are some data points on how much unrolling happens, and what flags we have at our disposal to control unrolling
Example 1.
conv.ll This is for the current vectorized convolution which runs on a single column (4 AIE cores). It has 3 nested loops of counts 4, 3, and 3, inside which there is a matmul.
9: (2 per core, because of ping-pong, and 1 for the func decl)
Running
opt
with the current flags:289: (72 per core. 72 = 233*4 -- so this full loop unrolling).
There are a few flags to control unrolling in llvm, see https://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ab5709dc220a64908090b46d1d1f6309b
For example using
unroll-threshold
with a high threshold can completely eliminate unrolling.289
33
Others that might be relevant are
unroll-count
,unroll-max-count
etc.