`<xutility>`: Use addition and multiplication overflow check MSVC intrinsics like `_add_overflow_i8`, `_mul_overflow_i16`, and `_mul_full_overflow_i8`

AlexGuteniev commented 4 months ago

Inspired by #4776

There are Clang intrinsics used in _Mul_overflow and _Add_overflow

MSVC has its own _mul_full_overflow_u8 and others. Using them would certainly result in a better codegen than the generic implementation.

Caveats:

Only for x86 / x64, some are only for x64
Not a complete set of all operations possible
Not constexpr, unlike the Clang one
Need to include <intrin.h>

These might warrant the need of another internal header for _Mul_overflow and _Add_overflow, used by <ranges>, <mdspan>, and <numeric> after #4776.

AlexGuteniev commented 4 months ago

Another caveat is maybe DevCom-10668881 -- need a good coverage

StephanTLavavej commented 4 months ago

Yeah, these were implemented on 2023-04-17 by internal MSVC-PR-456118 "Add overflow detection functions for addition, subtraction and multiplication on x86 and x64".

We'd eventually want to move them to intrin0.inl.h for the STL's use, if we want to go in this direction. We'd need to work around the known x86 codegen bug.

I'd like to understand the potential performance impact of this change before deciding whether to invest the contributor/maintainer effort in making/reviewing such a change, but if it's significant then this could be a reasonable thing to do.

AlexGuteniev commented 4 months ago

I'd like to understand the potential performance impact of this change before deciding whether to invest the contributor/maintainer effort in making/reviewing such a change, but if it's significant then this could be a reasonable thing to do.

The changes of _Add_overflow and _Mul_overflow themselves performance are expected to be significant.

For _Add_overflow the addition itself is one of the cheapest operation. The current implementation adds combination of six conditions, that would definitely make it order of magnitudes slower. I expect from the intrinsic to only expose overflow flag, but stay the same instruction.

For _Mul_overflow it is even more clear. The current implementation uses division, which order of magnitude slower than multiplication. Again, I expect only multiplication and flag or result checking from the instruction.

The usages of these are, as I mentioned <ranges>, <mdspan>, and the proposed usage in lcm.

All potentially-runtime <mdspan> usages guarded by #if _CONTAINER_DEBUG_LEVEL > 0, if I'm not missing anything. For <ranges>, some usages are guarded fully, but there is at least one usage where #if _ITERATOR_DEBUG_LEVEL != 0 only guards the overflow check, but not the operation. I don't know if that optimizes away or doesn't.

The proposed runtime lcm usage is guarded by _DEBUG.

So it seems that currently improving _Add_overflow and _Mul_overflow would benefit only debugging modes, except for the mentioned <ranges> occurrence..

We could simplify the implementation in lcm and possibly elsewhere by making codepath the same for debug and non-debug, and guarding only the check (like in that <ranges> occurence), but taking into account the complexity of the intrinsics usage, the total simplification is likely to be negative.

AlexGuteniev commented 3 months ago

We'd need to work around the known x86 codegen bug.

After thinking about possible workaround I conclude that the best would be not to use the broken variant/platform combinations. Any workaround that doesn't too much interfere the optimization might be itself optimized away.

It is is only x86 and not x64, this isn't too much impact.

microsoft / STL

`<xutility>`: Use addition and multiplication overflow check MSVC intrinsics like `_add_overflow_i8`, `_mul_overflow_i16`, and `_mul_full_overflow_i8` #4780