Open bmaurer opened 1 month ago
The optimal output for "the subtraction" version I believe is:
void sub2(unsigned char *n)
{
asm(
"cmpb $255, %0\n"
"sbbb $0, %0\n"
: "+r"(*n)
:
: "cc"
);
}
I couldn't get either compiler close to this.
I do some tests.
no transform to a saturating operation is done in the middle-end. ref : https://alive2.llvm.org/ce/z/ApY5rr
Even if we change binary operation for saturated, there is no proper logic in the backend to handle it.
Optimized type-legalized selection DAG: %bb.0 'src:entry'
SelectionDAG has 12 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t5: i8,ch = load<(load (s8) from %ir.x)> t0, t2, undef:i64
t13: i8,i8 = uaddo t5, Constant:i8<1>
t9: i8 = select t13:1, Constant:i8<-1>, t13
t10: ch = store<(store (s8) into %ir.x)> t5:1, t9, t2, undef:i64
t12: ch = X86ISD::RET_GLUE t10, TargetConstant:i32<0>
...
Optimized legalized selection DAG: %bb.0 'src:entry'
SelectionDAG has 15 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t5: i8,ch = load<(load (s8) from %ir.x)> t0, t2, undef:i64
t16: i8,i32 = X86ISD::ADD t5, Constant:i8<1>
t19: i32 = any_extend t16
t20: i32 = X86ISD::CMOV t19, Constant:i32<255>, TargetConstant:i8<4>, t16:1
t21: i8 = truncate t20
t10: ch = store<(store (s8) into %ir.x)> t5:1, t21, t2, undef:i64
t12: ch = X86ISD::RET_GLUE t10, TargetConstant:i32<0>
Optimized lowered selection DAG: %bb.0 'tgt:entry'
SelectionDAG has 10 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t5: i8,ch = load<(load (s8) from %ir.x)> t0, t2, undef:i64
t7: i8 = uaddsat t5, Constant:i8<1>
t8: ch = store<(store (s8) into %ir.x)> t5:1, t7, t2, undef:i64
t10: ch = X86ISD::RET_GLUE t8, TargetConstant:i32<0>
...
Optimized legalized selection DAG: %bb.0 'tgt:entry'
SelectionDAG has 15 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t5: i8,ch = load<(load (s8) from %ir.x)> t0, t2, undef:i64
t15: i8,i32 = X86ISD::ADD t5, Constant:i8<1>
t18: i32 = any_extend t15
t19: i32 = X86ISD::CMOV t18, Constant:i32<255>, TargetConstant:i8<4>, t15:1
t20: i8 = truncate t19
t8: ch = store<(store (s8) into %ir.x)> t5:1, t20, t2, undef:i64
t10: ch = X86ISD::RET_GLUE t8, TargetConstant:i32<0>
My guess is that the truncate conversion is saturated and needs to be handled.
A related missed optimization:
#include <cstdint>
void foo(uint8_t& x) {
if (x != 255) {
x++;
}
}
void foox(uint8_t& x) {
uint8_t tmp = x + 1;
if (tmp != 0) {
x = tmp;
}
}
Clang misses that it can save an instruction by transforming the first version into the second.
foo(unsigned char&): # @foo(unsigned char&)
movzx eax, byte ptr [rdi]
cmp al, -1
je .LBB0_2
inc al
mov byte ptr [rdi], al
.LBB0_2:
ret
foox(unsigned char&): # @foox(unsigned char&)
movzx eax, byte ptr [rdi]
inc al
je .LBB1_2
mov byte ptr [rdi], al
.LBB1_2:
ret
here is another way to write saturating unsigned add which is not optimized either:
#include <stdint.h>
void add3(uint8_t* x) {
uint8_t res;
uint8_t carry = __builtin_add_overflow(*x, 1, &res) ? 1 : 0;
*x = carry ? 255 : res;
}
The folly f14 hashtable has a 8 bit counter that "saturates" at 255.
https://github.com/facebook/folly/blob/c8b8d4cac3b7cf049b007fa08e12061c5b239a5e/folly/container/detail/F14Table.h#L572-L582
The optimal code for the incr case is:
Neither clang nor gcc does this for add1. GCC can do it for add2 but clang cannot. And I'm not sure what the optimal output is for the subtraction case but all of the compilers end up using multiple registers.
Full repro: https://godbolt.org/z/PqP7vcjsz