mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Apache License 2.0
281 stars 15 forks source link

Iteration code size comparison #1

Closed mratsim closed 6 years ago

mratsim commented 6 years ago

As of https://github.com/numforge/laser/blob/04a675950b651535dc5b6cdd2a62706755742270/benchmarks/loop_iteration/iter_bench.nim

This is approximative stop and start point are +- 10 instructions:

Global ref iter ~710 instructions 2018-10-17_11-52-03

Global TRIOT ~1950 instructions 2018-10-17_11-53-42

Per tensor ref iter ~830 instructions 2018-10-17_11-55-58

Fused per tensor ref iter ~600 instructions 2018-10-17_11-59-04

Note that GCC is auto-unrolling some loops.