Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only

AlexGuteniev commented 1 week ago

Initially I thought that it could be fixed by using careful minmax implementation, that selects correctly either the first or the last value when the comparands are equivalent.

I've learned the behavior of [v]{min|max}{s|p}{s|d} instructions (thanks @statementreply and @Alcaro for enlightening me on that), figured out that it was possible to control which of the equivalent values is the result, also I've reported the compiler bug DevCom-10686775, and found a reliable workaround for it.

Unfortunately, the control over a single minmax instruction result is not enough. The whole value-based appoach does not work with vectorization. Efficient vectorization requires vertical comparisons (same elements on different vector values) to be performed first, and horiziontal comparisons (different elements on the same vector value) to be performed last. With index-based approach, changed order is fine, as we're looking for smallest/greatest index.

As a result, we have to resort to using minmax_element approach for floating minmax, unless /fp:fast is specified. Should be not a big loss though -- the benchmark results in #4659 shows that smaller types benefit from minmax approach a lot, but floats not a lot. Definitely still way faster than scalar.

/fp:fast is still fine, as the compiler takes advantage of not distinguishing +0.0 and -0,0 and is able to emit vectorized minmax itself (see related issue #4453)

I decided to keep comparisons reordering for floats in -- this seems to improve the handing of NAN values, which is decided to be unsupported, but why won't keep something that accidentally does things better.

⏱️ Benchmark results

C:\Project\STL>out\bench\benchmark-minmax_element.exe --benchmark_min_time=1s --benchmark_filter=(float^|double)
2024-06-22T13:46:09+03:00
Running out\bench\benchmark-minmax_element.exe
Run on (12 X 2496 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)

Benchmark	main	fix	fix + `/fp:fast`
bm<float, 8021, Op::Min>	1184 ns	1207 ns	1221 ns
bm<float, 8021, Op::Max>	1210 ns	1212 ns	1208 ns
bm<float, 8021, Op::Both>	1357 ns	1379 ns	1362 ns
bm<float, 8021, Op::Min_val>	891 ns	1211 ns	891 ns
bm<float, 8021, Op::Max_val>	915 ns	1222 ns	883 ns
bm<float, 8021, Op::Both_val>	955 ns	1378 ns	940 ns
bm<double, 8021, Op::Min>	2246 ns	2393 ns	2352 ns
bm<double, 8021, Op::Max>	2365 ns	2393 ns	2361 ns
bm<double, 8021, Op::Both>	2719 ns	2753 ns	2727 ns
bm<double, 8021, Op::Min_val>	1880 ns	2365 ns	1849 ns
bm<double, 8021, Op::Max_val>	1877 ns	2358 ns	1868 ns
bm<double, 8021, Op::Both_val>	1933 ns	2688 ns	1913 ns

The reodreding of _mm[256]_{min|max}_p{s|d} args seems a bit unfavorable for performance, but not very much, at least the results difference is within variation.

AlexGuteniev commented 1 week ago

The test now expectedly fails for minmax. It is broken in multiple places due to -0.0 / +0.0 distinction was missed, maybe even in the test itself.

The _element one is fine though,

AlexGuteniev commented 1 week ago

Fixing the behavior is blocked on the compiler bug DevCom-10686775

StephanTLavavej commented 1 week ago

:warning: @AlexGuteniev mentioned that there will be a stealth merge conflict with my #4741, where vector_algorithms.cpp is testing _M_IX86_FP.

microsoft / STL

Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only #4734

⏱️ Benchmark results