Add float32 implementation of min/max/sum

This creates a proper API for reduction primitives min/max/sum to address #36.

It's 80x faster than naive reduction on my 18 cores machines:

about 3.5x comes from using multiple accumulators
about 18x comes from multithreading
a non-negligeable part comes from the mm_max_ps intrinsics (See #38)

https://github.com/numforge/laser/blob/f4930cb03f9bf8ec4180f8c34d7b12552c3ebb08/benchmarks/fp_reduction_latency/reduction_max_bench.nim

Warmup: 0.9007 s, result 224 (displayed to avoid compiler optimizing warmup away)

Max reduction - prod impl - float32
Collected 1000 samples in 0.250 seconds
Average time: 0.248 ms
Stddev  time: 0.641 ms
Min     time: 0.149 ms
Max     time: 8.449 ms
Theoretical perf: 40287.484 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999996423721313

Reduction - 1 accumulator - simple iter - float32
Collected 1000 samples in 18.544 seconds
Average time: 18.543 ms
Stddev  time: 0.234 ms
Min     time: 18.470 ms
Max     time: 25.110 ms
Theoretical perf: 539.277 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999996423721313

Reduction - 1 accumulator - macro iter - float32
Collected 1000 samples in 18.603 seconds
Average time: 18.602 ms
Stddev  time: 0.037 ms
Min     time: 18.472 ms
Max     time: 18.687 ms
Theoretical perf: 537.569 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999996423721313

Reduction - 2 accumulators - simple iter - float32
Collected 1000 samples in 10.287 seconds
Average time: 10.286 ms
Stddev  time: 0.046 ms
Min     time: 10.212 ms
Max     time: 10.451 ms
Theoretical perf: 972.164 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999996423721313

Reduction - 3 accumulators - simple iter - float32
Collected 1000 samples in 7.722 seconds
Average time: 7.721 ms
Stddev  time: 0.094 ms
Min     time: 7.574 ms
Max     time: 8.015 ms
Theoretical perf: 1295.233 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999996423721313

Reduction - 4 accumulators - simple iter - float32
Collected 1000 samples in 6.062 seconds
Average time: 6.061 ms
Stddev  time: 0.055 ms
Min     time: 5.965 ms
Max     time: 6.221 ms
Theoretical perf: 1649.943 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999994039535522

Reduction - 5 accumulators - simple iter - float32
Collected 1000 samples in 5.506 seconds
Average time: 5.505 ms
Stddev  time: 0.058 ms
Min     time: 5.395 ms
Max     time: 5.796 ms
Theoretical perf: 1816.395 MFLOP/s

Display sum of samples sums to make sure it's not optimized away
0.9999996423721313

mratsim / laser

Add float32 implementation of min/max/sum #39