Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*m*n*k): 3.43597e+10
execution times for all 10 runs [s]:
0.0586853 0.0539645 0.0539398 0.0539536 0.0540862 0.0540921 0.054008 0.0540939 0.0540261 0.0540442
teraops/second over all 10 runs:
0.585491 0.63671 0.637002 0.636839 0.635277 0.635208 0.636197 0.635187 0.635984 0.635771
mean time over runs: 0.0544894 [s]
minimum time over runs: 0.0539398 [s]
max teraops/second: 0.637002 [teraops/second]
Using ukernel
Benchmark summary
=================
m: 2048
k: 4096
n: 2048
number of operations (2*M*N*K): 3.43597e+10
execution times for all 10 runs [s]:
0.0219439 0.0181417 0.0181282 0.0177506 0.0187402 0.0183464 0.0187545 0.0181528 0.0177086 0.0180813
teraops/second over all 10 runs:
1.5658 1.89397 1.89537 1.93569 1.83348 1.87283 1.83208 1.89281 1.94029 1.90029
mean time over runs: 0.0185748 [s]
minimum time over runs: 0.0177086 [s]
max teraops/second: 1.94029 [teraops/second]
So the ukernel approach is currently 3x faster. This is a lower bound though (i.e. core ukernel probably more than 3x faster). Consider:
as other-time is the same in both approaches (date movement between DDR <-> memtile <-> core is identical)
I think on this phoenix machine, theoretical max is 4 tops/second. So ukernel approach is 50% of theoretical max.
Two extremes:
1) all time in ukernel. i.e. other-time = 0. Then time-in-dcg = 3 time-in-ukernel
2) 50% of time in ukernel (i.e. ukernel itself is 100% efficient). i.e. other-time = time-in-ukernel. Then time-in-dcg = 5 time-in-ukernel.
So performance of ukernel is between 3x and 5x better than dcg.
End to end script. Running locally (nuc50)
Direct codegen
Using ukernel
So the ukernel approach is currently 3x faster. This is a lower bound though (i.e. core ukernel probably more than 3x faster). Consider:
Where
other-time
is the same in the 2 experiments, as only the instruction memory is different (identical DMA data movement). We observed thatso that
as
other-time
is the same in both approaches (date movement between DDR <-> memtile <-> core is identical)I think on this phoenix machine, theoretical max is
4 tops/second
. So ukernel approach is 50% of theoretical max.Two extremes:
1) all time in ukernel. i.e. other-time = 0. Then time-in-dcg = 3 time-in-ukernel 2) 50% of time in ukernel (i.e. ukernel itself is 100% efficient). i.e. other-time = time-in-ukernel. Then time-in-dcg = 5 time-in-ukernel.
So performance of ukernel is between 3x and 5x better than dcg.