Warmup: 0.9007 s, result 224 (displayed to avoid compiler optimizing warmup away)
Max reduction - prod impl - float32
Collected 1000 samples in 0.250 seconds
Average time: 0.248 ms
Stddev time: 0.641 ms
Min time: 0.149 ms
Max time: 8.449 ms
Theoretical perf: 40287.484 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999996423721313
Reduction - 1 accumulator - simple iter - float32
Collected 1000 samples in 18.544 seconds
Average time: 18.543 ms
Stddev time: 0.234 ms
Min time: 18.470 ms
Max time: 25.110 ms
Theoretical perf: 539.277 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999996423721313
Reduction - 1 accumulator - macro iter - float32
Collected 1000 samples in 18.603 seconds
Average time: 18.602 ms
Stddev time: 0.037 ms
Min time: 18.472 ms
Max time: 18.687 ms
Theoretical perf: 537.569 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999996423721313
Reduction - 2 accumulators - simple iter - float32
Collected 1000 samples in 10.287 seconds
Average time: 10.286 ms
Stddev time: 0.046 ms
Min time: 10.212 ms
Max time: 10.451 ms
Theoretical perf: 972.164 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999996423721313
Reduction - 3 accumulators - simple iter - float32
Collected 1000 samples in 7.722 seconds
Average time: 7.721 ms
Stddev time: 0.094 ms
Min time: 7.574 ms
Max time: 8.015 ms
Theoretical perf: 1295.233 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999996423721313
Reduction - 4 accumulators - simple iter - float32
Collected 1000 samples in 6.062 seconds
Average time: 6.061 ms
Stddev time: 0.055 ms
Min time: 5.965 ms
Max time: 6.221 ms
Theoretical perf: 1649.943 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999994039535522
Reduction - 5 accumulators - simple iter - float32
Collected 1000 samples in 5.506 seconds
Average time: 5.505 ms
Stddev time: 0.058 ms
Min time: 5.395 ms
Max time: 5.796 ms
Theoretical perf: 1816.395 MFLOP/s
Display sum of samples sums to make sure it's not optimized away
0.9999996423721313
This creates a proper API for reduction primitives min/max/sum to address #36.
It's 80x faster than naive reduction on my 18 cores machines:
mm_max_ps
intrinsics (See #38)https://github.com/numforge/laser/blob/f4930cb03f9bf8ec4180f8c34d7b12552c3ebb08/benchmarks/fp_reduction_latency/reduction_max_bench.nim