Open sqrmax opened 3 years ago
Review: https://reviews.llvm.org/D111530
I 'd like to fix this issue if there are no objections.
LLVM mid-end generates comparison of j
with 1<<60 for loop's condition and following DAG node is created for it:
t19: i1 = setcc t7, Constant:i128<1152921504606846976>, setult:ch
Then it is combined into following nodes:
t33: i128 = srl t7, Constant:i64<60>
t40: i1 = setcc t33, Constant:i128<0>, setne:ch
That are legalized into:
t45: i64 = srl t41, Constant:i64<60>
t44: i64 = shl t42, Constant:i64<4>
t46: i64 = or t45, t44
t47: i64 = srl t42, Constant:i64<60>
t49: i64 = or t46, t47
t48: i8 = setcc t49, Constant:i64<0>, setne:ch
And finally combined into:
t53: i64 = fshl t42, t41, Constant:i64<4>
t47: i64 = srl t42, Constant:i64<60>
t49: i64 = or t53, t47
t48: i8 = setcc t49, Constant:i64<0>, setne:ch
I see several ways to get rid of FSHL: 1) Hook into DAGTypeLegalizer::ExpandShiftByConstant, check that SRL has only one use and it's SetCC Eq/Ne 0, and then skip unnecessary shifts; 2) Explicitly match expanded SRL and combine it into (setcc (or (srl a), b, C), 0, setne); 3) Add several peepholes into DAGCombiner to iteratively combine: a) (or (fshl a, b, C1), (srl a, C2)) => (or (rotl a), (srl b)) // it is profitable disregard setcc optimization because rotations are usually faster than funnel shifts b) (setcc (or (rotl a), b), 0, setne) => (setcc (or a, b), 0, setne)
Personally I more into 3rd approach as it will work not only for the single case with setcc, but I'm not quite sure if patterns that could be combined by it arise frequently enough to make that approach reasonable to implement.
Note that the issue is not only about 128-bit variables and the same suboptimal code will be emitted for 64-bit variable on 32-bit architectures (for Aarch32 in particular).
Current Codegen: https://godbolt.org/z/bWvGK5vhT
assigned to @nerh
Current Codegen:
slow128(unsigned __int128): # @slow128(unsigned __int128)
movq %rdi, %rax
.LBB0_1: # =>This Inner Loop Header: Depth=1
addq $1, %rax
adcq $0, %rsi
movq %rax, %rcx
orq %rsi, %rcx
shrdq $60, %rsi, %rcx
jne .LBB0_1
retq
or (with slowSHLD):
slow128(unsigned __int128): # @slow128(unsigned __int128)
movq %rdi, %rax
.LBB0_1: # =>This Inner Loop Header: Depth=1
addq $1, %rax
adcq $0, %rsi
movq %rax, %rdx
movq %rsi, %rcx
shrq $60, %rdx
rolq $4, %rcx
orq %rcx, %rdx
jne .LBB0_1
retq
@llvm/issue-subscribers-backend-x86
I have one more SDAG patch in the works for this. AFAICT, we've already fixed the motivating example on targets other than x86. But x86 has legal funnel-shift instructions for scalar types (shrd/shld), so we need one more combine for setcc like this: https://alive2.llvm.org/ce/z/wsbSDw // fshl (or F1, Y), F1, C ==/!= 0 --> or (shl Y, C), F1 ==/!= 0 // fshl F0, (or Y, F0), C ==/!= 0 --> or (srl Y, BW-C), F0 ==/!= 0
It's a minor diff for most x86 targets - just trading a funnel shift for a shift. For vectors, it would eliminate a shift. We could add that in IR too, but it would be too early to help this specific problem.
There are 2 patterns 2 (commutes) 2 (left/right) * 2 (predicates) = 16 tests needed.
@sqrmax @nerh @rotateright Resolve this now?
The motivating example was fixed but there's still an issue with funnel shifts comparison with zero for shifts having operands 4 (or more) times wider than target machine's register (i.e. 128 bit funnel shifts and ARM target, see https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/ARM/icmp-shift-opt.ll#L139).
OK cheers - I'm making this a generic codegen bug as its not just x86
Is there a source example of how we would get the "4 or more" pattern to the backend? InstCombine canonicalizes the IR in the remaining ARM test ( https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/ARM/icmp-shift-opt.ll#L139 ) to use a mask instead of a shift: https://alive2.llvm.org/ce/z/xP-ZEN
And the backend seems to get the ideal asm once that is done: https://godbolt.org/z/ovMjn6en8
So if there's still some case that is escaping optimization, we could try the shift->and transform in SDAG.
If you replace the left shift with a right shift in the example then InstCombine will replace the shift with comparison with a constant: https://alive2.llvm.org/ce/z/WMQWq4
And DAGCombiner will replace it with shift: https://godbolt.org/z/b4Wfdjj8f
There are no funnel shifts on ARM so DAGCombiner doesn't fold corresponding subtrees into FSHL/FSHR and as a result we can't use existing optimizations to eliminate unnecessary shifts.
Extended Description
See attached code, compiled with -O2. The condition inside the while () test is unnecessarily evaluated with a 128 bit shift instruction,
even though a 64 bit shift instruction suffices. However, changing || to | in the logic condition yields the more efficient code below.
Found with clang-10 on Ubuntu 20.04 LTS, verified for clang 10, 11, and trunk using godbolt. Note that gcc -O2 handles both of these cases emitting the more efficient code.