Open AlexGuteniev opened 1 week ago
The test now expectedly fails for minmax
.
It is broken in multiple places due to -0.0
/ +0.0
distinction was missed, maybe even in the test itself.
The _element
one is fine though,
Fixing the behavior is blocked on the compiler bug DevCom-10686775
:warning: @AlexGuteniev mentioned that there will be a stealth merge conflict with my #4741, where vector_algorithms.cpp
is testing _M_IX86_FP
.
Initially I thought that it could be fixed by using careful minmax implementation, that selects correctly either the first or the last value when the comparands are equivalent.
I've learned the behavior of
[v]{min|max}{s|p}{s|d}
instructions (thanks @statementreply and @Alcaro for enlightening me on that), figured out that it was possible to control which of the equivalent values is the result, also I've reported the compiler bug DevCom-10686775, and found a reliable workaround for it.Unfortunately, the control over a single minmax instruction result is not enough. The whole value-based appoach does not work with vectorization. Efficient vectorization requires vertical comparisons (same elements on different vector values) to be performed first, and horiziontal comparisons (different elements on the same vector value) to be performed last. With index-based approach, changed order is fine, as we're looking for smallest/greatest index.
As a result, we have to resort to using
minmax_element
approach for floatingminmax
, unless/fp:fast
is specified. Should be not a big loss though -- the benchmark results in #4659 shows that smaller types benefit fromminmax
approach a lot, but floats not a lot. Definitely still way faster than scalar./fp:fast
is still fine, as the compiler takes advantage of not distinguishing+0.0
and-0,0
and is able to emit vectorizedminmax
itself (see related issue #4453)I decided to keep comparisons reordering for floats in -- this seems to improve the handing of NAN values, which is decided to be unsupported, but why won't keep something that accidentally does things better.
⏱️ Benchmark results
/fp:fast
The reodreding of
_mm[256]_{min|max}_p{s|d}
args seems a bit unfavorable for performance, but not very much, at least the results difference is within variation.