microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
10.07k stars 1.48k forks source link

Prefer xor to sub for counting bits #4693

Closed AreaZR closed 4 months ago

AreaZR commented 4 months ago

For subtracting from 63 or 31 alone, exclusive or is faster than sub because exclusive or doesn't set any flags, unlike sub.

AlexGuteniev commented 4 months ago

exclusive or doesn't set any flags, unlike sub.

This is not correct. On x86 and x86-64 xor does set flags.

Still, this is indeed a bit faster, because xor is commutative, the compiler reorders args, and the constant becomes immediate, which saves an instruction.

The previous attempt was rejected due to little gain for more complexity, it was suggested that the compiler should implement this optimization instead.

If we are doing this, should also change the occurrences in vector_algorithms.cpp. And the remaining occurrences in __msvc_bit_utils.hpp. Some may be spelled as _Digits - 1 - _Result or something.

AreaZR commented 4 months ago

The compiler doing this was actually rejected for multiple reasons.

AreaZR commented 4 months ago

It also is because this only works if the compiler knows bitscan returns a number between 0 and 31

Alcaro commented 4 months ago

The relevant previous discussion is https://github.com/microsoft/STL/pull/3925#discussion_r1350990772

Clang can do this optimization, so in theory, it's possible. https://godbolt.org/z/hc8hrj8eo

But in practice, the definition of _BitScanReverse is more complex than __builtin_clz - latter has undefined behavior if input is zero, former returns an unspecified value that's not guaranteed to be in the range 0 to 31 (this optimization is only valid if rhs <= lhs; 31-32 != 31^32).

It is possible to model the builtin in a way that allows this optimization, but it's more complicated than simply 'the return value is always 0 to 31, and if it isn't, you dun goofed and it's not a compiler bug'.

Long-term, teaching the compiler to optimize that is the better solution. But I don't know how long term; it may be long enough that it's wiser to install a short-term solution as well (with a TRANSITION comment).

AlexGuteniev commented 4 months ago

Previous discussion was related specifically to vector_algoritmgs.cpp https://github.com/microsoft/STL/pull/3925#discussion_r1350990772

The optimization in that place has less benefits and adds more confusion, so the outcome in general may be reconsidered.

StephanTLavavej commented 4 months ago

Thanks for looking into this. We talked about this in the weekly maintainer meeting, and have decided that the potential savings of less than a cycle are outweighed by the decrease in code clarity. We believe that the compiler should ultimately be responsible for emitting ideal codegen here, and DevCom-10487304 is still active, so we're going to close this PR without merging.