microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
9.88k stars 1.45k forks source link

Floating minmax: fix negative zero handling and dedicated test coverage for arrays of +0.0 and -0.0 only #4734

Open AlexGuteniev opened 1 week ago

AlexGuteniev commented 1 week ago

Initially I thought that it could be fixed by using careful minmax implementation, that selects correctly either the first or the last value when the comparands are equivalent.

I've learned the behavior of [v]{min|max}{s|p}{s|d} instructions (thanks @statementreply and @Alcaro for enlightening me on that), figured out that it was possible to control which of the equivalent values is the result, also I've reported the compiler bug DevCom-10686775, and found a reliable workaround for it.

Unfortunately, the control over a single minmax instruction result is not enough. The whole value-based appoach does not work with vectorization. Efficient vectorization requires vertical comparisons (same elements on different vector values) to be performed first, and horiziontal comparisons (different elements on the same vector value) to be performed last. With index-based approach, changed order is fine, as we're looking for smallest/greatest index.

As a result, we have to resort to using minmax_element approach for floating minmax, unless /fp:fast is specified. Should be not a big loss though -- the benchmark results in #4659 shows that smaller types benefit from minmax approach a lot, but floats not a lot. Definitely still way faster than scalar.

/fp:fast is still fine, as the compiler takes advantage of not distinguishing +0.0 and -0,0 and is able to emit vectorized minmax itself (see related issue #4453)

I decided to keep comparisons reordering for floats in -- this seems to improve the handing of NAN values, which is decided to be unsupported, but why won't keep something that accidentally does things better.


⏱️ Benchmark results

C:\Project\STL>out\bench\benchmark-minmax_element.exe --benchmark_min_time=1s --benchmark_filter=(float^|double)
2024-06-22T13:46:09+03:00
Running out\bench\benchmark-minmax_element.exe
Run on (12 X 2496 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1280 KiB (x6)
  L3 Unified 12288 KiB (x1)
Benchmark main fix fix + /fp:fast
bm<float, 8021, Op::Min> 1184 ns 1207 ns 1221 ns
bm<float, 8021, Op::Max> 1210 ns 1212 ns 1208 ns
bm<float, 8021, Op::Both> 1357 ns 1379 ns 1362 ns
bm<float, 8021, Op::Min_val> 891 ns 1211 ns 891 ns
bm<float, 8021, Op::Max_val> 915 ns 1222 ns 883 ns
bm<float, 8021, Op::Both_val> 955 ns 1378 ns 940 ns
bm<double, 8021, Op::Min> 2246 ns 2393 ns 2352 ns
bm<double, 8021, Op::Max> 2365 ns 2393 ns 2361 ns
bm<double, 8021, Op::Both> 2719 ns 2753 ns 2727 ns
bm<double, 8021, Op::Min_val> 1880 ns 2365 ns 1849 ns
bm<double, 8021, Op::Max_val> 1877 ns 2358 ns 1868 ns
bm<double, 8021, Op::Both_val> 1933 ns 2688 ns 1913 ns

The reodreding of _mm[256]_{min|max}_p{s|d} args seems a bit unfavorable for performance, but not very much, at least the results difference is within variation.

AlexGuteniev commented 1 week ago

The test now expectedly fails for minmax. It is broken in multiple places due to -0.0 / +0.0 distinction was missed, maybe even in the test itself.

The _element one is fine though,

AlexGuteniev commented 1 week ago

Fixing the behavior is blocked on the compiler bug DevCom-10686775

StephanTLavavej commented 1 week ago

:warning: @AlexGuteniev mentioned that there will be a stealth merge conflict with my #4741, where vector_algorithms.cpp is testing _M_IX86_FP.