Closed AreaZR closed 4 months ago
exclusive or doesn't set any flags, unlike sub.
This is not correct. On x86 and x86-64 xor does set flags.
Still, this is indeed a bit faster, because xor
is commutative, the compiler reorders args, and the constant becomes immediate, which saves an instruction.
The previous attempt was rejected due to little gain for more complexity, it was suggested that the compiler should implement this optimization instead.
If we are doing this, should also change the occurrences in vector_algorithms.cpp
. And the remaining occurrences in __msvc_bit_utils.hpp
. Some may be spelled as _Digits - 1 - _Result
or something.
The compiler doing this was actually rejected for multiple reasons.
It also is because this only works if the compiler knows bitscan returns a number between 0 and 31
The relevant previous discussion is https://github.com/microsoft/STL/pull/3925#discussion_r1350990772
Clang can do this optimization, so in theory, it's possible. https://godbolt.org/z/hc8hrj8eo
But in practice, the definition of _BitScanReverse is more complex than __builtin_clz - latter has undefined behavior if input is zero, former returns an unspecified value that's not guaranteed to be in the range 0 to 31 (this optimization is only valid if rhs <= lhs; 31-32 != 31^32).
It is possible to model the builtin in a way that allows this optimization, but it's more complicated than simply 'the return value is always 0 to 31, and if it isn't, you dun goofed and it's not a compiler bug'.
Long-term, teaching the compiler to optimize that is the better solution. But I don't know how long term; it may be long enough that it's wiser to install a short-term solution as well (with a TRANSITION comment).
Previous discussion was related specifically to vector_algoritmgs.cpp
https://github.com/microsoft/STL/pull/3925#discussion_r1350990772
The optimization in that place has less benefits and adds more confusion, so the outcome in general may be reconsidered.
Thanks for looking into this. We talked about this in the weekly maintainer meeting, and have decided that the potential savings of less than a cycle are outweighed by the decrease in code clarity. We believe that the compiler should ultimately be responsible for emitting ideal codegen here, and DevCom-10487304 is still active, so we're going to close this PR without merging.
For subtracting from 63 or 31 alone, exclusive or is faster than sub because exclusive or doesn't set any flags, unlike sub.