nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
46 stars 23 forks source link

Benchmark: matmul ukernel vs direct codegen #415

Open newling opened 3 weeks ago

newling commented 3 weeks ago

End to end script. Running locally (nuc50)

Direct codegen

Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*m*n*k): 3.43597e+10
execution times for all 10 runs [s]:
    0.0586853 0.0539645 0.0539398 0.0539536 0.0540862 0.0540921 0.054008 0.0540939 0.0540261 0.0540442
teraops/second over all 10 runs:
    0.585491 0.63671 0.637002 0.636839 0.635277 0.635208 0.636197 0.635187 0.635984 0.635771
mean time over runs: 0.0544894 [s]
minimum time over runs: 0.0539398 [s]
max teraops/second: 0.637002 [teraops/second]

Using ukernel

Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*M*N*K): 3.43597e+10
execution times for all 10 runs [s]:
    0.0219439 0.0181417 0.0181282 0.0177506 0.0187402 0.0183464 0.0187545 0.0181528 0.0177086 0.0180813
teraops/second over all 10 runs:
    1.5658 1.89397 1.89537 1.93569 1.83348 1.87283 1.83208 1.89281 1.94029 1.90029
mean time over runs: 0.0185748 [s]
minimum time over runs: 0.0177086 [s]
max teraops/second: 1.94029 [teraops/second]

So the ukernel approach is currently 3x faster. This is a lower bound though (i.e. core ukernel probably more than 3x faster). Consider:

total-time-ukernel = time-in-ukernel + other-time
total-time-dcg = time-in-dcg + other-time

Where other-time is the same in the 2 experiments, as only the instruction memory is different (identical DMA data movement). We observed that

total-time-ukernel / total-time-dcg = 1/3

so that

time-in-ukernel / time-in-dcg = 1/3 - 2/3 * (other-time / time-in-dcg) < 1/3

as other-time is the same in both approaches (date movement between DDR <-> memtile <-> core is identical)

I think on this phoenix machine, theoretical max is 4 tops/second. So ukernel approach is 50% of theoretical max.

Two extremes:

1) all time in ukernel. i.e. other-time = 0. Then time-in-dcg = 3 time-in-ukernel 2) 50% of time in ukernel (i.e. ukernel itself is 100% efficient). i.e. other-time = time-in-ukernel. Then time-in-dcg = 5 time-in-ukernel.

So performance of ukernel is between 3x and 5x better than dcg.