Open mingmingl-llvm opened 10 months ago
https://reviews.llvm.org/D155688 canonicalizes add + gep
to gep + gep
(a benign optimization), but its interaction with the subsequent passes (simplify-cfg that converts if-else
to conditional move, and the code generation pipeline) generates less efficient code.
Was doing some initial investigation on some degradations we were having in downstream ARM (and other, non-upstream) targets and found this issue. We have discovered similar performance problems with the linked commit co-authored by @paulwalker-arm
The review specifically states in a comment that an 'inbounds preservation is incorrect', but it is this inbounds flag that passes like LoopFlatten (integral for the coremark benchmark matrix-related loops) rely upon to identify possible UB and loosen restrictions on optimization. Specifically, a basic NxN matrix add loop will not flatten.
Flattening tries to take the NxN nested for into a single [0, NN) loop. If NN can overflow, then the optimization can't be performed. To resolve this in some cases, the pass would look at the geps feeding into loads/stores and detect if such an overflow would invoke UB. In such cases, N*N can't overflow because it would invoke UB, and flattening can continue.
This check for UB hinges on inbounds, which is being removed in the transformation.
The change itself has no justification in the commit log or the review, so why was it done? I wouldn't say that the removal of 'inbounds' from a gep is in any way a benign change if optimizations like this are denied. Would it make sense to do the gep + gep transformation iff the inbounds flag is not present?
Edit: Apologies for the misclicked 'completed!
Was doing some initial investigation on some degradations we were having in downstream ARM (and other, non-upstream) targets and found this issue. We have discovered similar performance problems with the linked commit co-authored by @paulwalker-arm
To clarify, the patch went through some changes during the review and became more general.
The change itself has no justification in the commit log or the review, so why was it done? I wouldn't say that the removal of 'inbounds' from a gep is in any way a benign change if optimizations like this are denied. Would it make sense to do the gep + gep transformation iff the inbounds flag is not present?
As mentioned in https://reviews.llvm.org/D155688, we saw some benefits with some benchmarks particularly 549.fotonik_3d in spec2017.
Also, see the review comment by @nikic that encouraged a more general approach that leads to a canonical representation.
Alternatively, it would also be possible to canonicalize gep(p, add(x, y)) to gep(gep(p, x), y) in general, in which case existing GEP reassociation support in LICM will take care of the rest. Arguably this is cleaner (as we should have a canonical form between these two possibilities), but it's more likely to cause fallout.
When some concerns were expressed we did offer to revert but at that time there were no strong reasons to do so.
I reran 549.fotonik3d_r and still see benefits from https://reviews.llvm.org/D155688 on an AArch64 machine. I can also see that there were subsequent patches (canonicalize sext-add + gep -> gep + gep) that moved in this direction like https://github.com/llvm/llvm-project/pull/69581.
EDIT: Corrected the link to the subsequent patch.
Does https://github.com/llvm/llvm-project/pull/78576 not address the loop flattening regressions?
Does #78576 not address the loop flattening regressions?
Perhaps it does. Our team is lagging a bit on pulling upstream into our code base, so I'll definitely look into this then. Thanks for bringing it to my attention!
https://gcc.godbolt.org/z/YjbY7snas shows the reduced C++ source code, IR and generated code for x86, and https://gcc.godbolt.org/z/W69boErh1 is the same test case for aarch64. For both x86 and aarch64, both clang 16.0.0 produces more efficient code than trunk. The C++ source code has an outer loop (
do {...} while
) and an inner-loop (unrolled into two basic blocks)For x86,
LBB0_1
is the first unrolled basic block in the inner-loop. Clang 16.0.0 generates 14 instructions while trunk generates 15 instructions. Besides, code sequence from trunk are less efficient;lea
instructions of format[base + index*scale + displacement]
has higher latency and lower throughput thanadd
on some x86 processors (see Agner's instruction table)For aarch64, in clang 16.0.0,
ip += select(is_literal, next_literal_tag + 1, tag_type + 1
sinks fromLBB0_3
(the 2nd basic block of unrolled inner loop) intoLBB0_5
(the exiting block of outer loop)The test case is reduced from a function in an open-source compression/decompression library.