Open AlexGuteniev opened 4 months ago
Another caveat is maybe DevCom-10668881 -- need a good coverage
Yeah, these were implemented on 2023-04-17 by internal MSVC-PR-456118 "Add overflow detection functions for addition, subtraction and multiplication on x86 and x64".
We'd eventually want to move them to intrin0.inl.h
for the STL's use, if we want to go in this direction. We'd need to work around the known x86 codegen bug.
I'd like to understand the potential performance impact of this change before deciding whether to invest the contributor/maintainer effort in making/reviewing such a change, but if it's significant then this could be a reasonable thing to do.
I'd like to understand the potential performance impact of this change before deciding whether to invest the contributor/maintainer effort in making/reviewing such a change, but if it's significant then this could be a reasonable thing to do.
The changes of _Add_overflow
and _Mul_overflow
themselves performance are expected to be significant.
For _Add_overflow
the addition itself is one of the cheapest operation. The current implementation adds combination of six conditions, that would definitely make it order of magnitudes slower. I expect from the intrinsic to only expose overflow flag, but stay the same instruction.
For _Mul_overflow
it is even more clear. The current implementation uses division, which order of magnitude slower than multiplication. Again, I expect only multiplication and flag or result checking from the instruction.
The usages of these are, as I mentioned <ranges>
, <mdspan>
, and the proposed usage in lcm
.
All potentially-runtime <mdspan>
usages guarded by #if _CONTAINER_DEBUG_LEVEL > 0
, if I'm not missing anything. For <ranges>
, some usages are guarded fully, but there is at least one usage where #if _ITERATOR_DEBUG_LEVEL != 0
only guards the overflow check, but not the operation. I don't know if that optimizes away or doesn't.
The proposed runtime lcm
usage is guarded by _DEBUG
.
So it seems that currently improving _Add_overflow
and _Mul_overflow
would benefit only debugging modes, except for the mentioned <ranges>
occurrence..
We could simplify the implementation in lcm
and possibly elsewhere by making codepath the same for debug and non-debug, and guarding only the check (like in that <ranges>
occurence), but taking into account the complexity of the intrinsics usage, the total simplification is likely to be negative.
We'd need to work around the known x86 codegen bug.
After thinking about possible workaround I conclude that the best would be not to use the broken variant/platform combinations. Any workaround that doesn't too much interfere the optimization might be itself optimized away.
It is is only x86 and not x64, this isn't too much impact.
Inspired by #4776
There are Clang intrinsics used in
_Mul_overflow
and_Add_overflow
MSVC has its own
_mul_full_overflow_u8
and others. Using them would certainly result in a better codegen than the generic implementation.Caveats:
constexpr
, unlike the Clang one<intrin.h>
These might warrant the need of another internal header for
_Mul_overflow
and_Add_overflow
, used by<ranges>
,<mdspan>
, and<numeric>
after #4776.