Consider emulating 192-bit integer using a 128-bit integer and a 64-bit integer. In the code sample this emulated integer is used to compute dot product of two uint64_t vectors of length N.
// function to compute dot product of two vectors
using u128 = unsigned __int128;
const int N = 2048;
uint64_t a[N], b[N];
u128 sum = 0;
uint64_t overflow = 0;
for(int i=0;i<N;++i){
u128 prod = (u128) a[i] * (u128) b[i];
sum += prod;
// gcc branches, clang just uses: adc overflow, 0
overflow += sum<prod;
}
To check for overflow in 128-bit and subsequently propagate the carry to overflow, adc can be used. This idiom works well when loops are rolled (no-unroll).
Extended Description
Consider emulating 192-bit integer using a 128-bit integer and a 64-bit integer. In the code sample this emulated integer is used to compute dot product of two uint64_t vectors of length N.
To check for overflow in 128-bit and subsequently propagate the carry to
overflow
,adc
can be used. This idiom works well when loops are rolled (no-unroll).clang++ -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops
But when loops are unrolled this efficient ASM degrades to
mov; setb; movzx; add;
Instead of justadc reg, 0
.clang++ -O3 -Wall -Wextra -march=broadwell # fno-unroll-loops is absent
For complete source code, here is the godbolt link: https://godbolt.org/z/tT7Z2H
Source of this discussion is the stackoverflow Q&A: https://stackoverflow.com/q/59575408/8199790