The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
277
stars
15
forks
source link
Iteration code size comparison #1
Closed
mratsim closed 6 years ago
As of https://github.com/numforge/laser/blob/04a675950b651535dc5b6cdd2a62706755742270/benchmarks/loop_iteration/iter_bench.nim
This is approximative stop and start point are +- 10 instructions:
Global ref iter ~710 instructions
Global TRIOT ~1950 instructions
Per tensor ref iter ~830 instructions
Fused per tensor ref iter ~600 instructions
Note that GCC is auto-unrolling some loops.