missed optimization : `gep + gep -> add + gep`

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.34k stars 12.13k forks source link

missed optimization : `gep + gep -> add + gep` #78214

Open mingmingl-llvm opened 10 months ago

mingmingl-llvm commented 10 months ago

https://gcc.godbolt.org/z/YjbY7snas shows the reduced C++ source code, IR and generated code for x86, and https://gcc.godbolt.org/z/W69boErh1 is the same test case for aarch64. For both x86 and aarch64, both clang 16.0.0 produces more efficient code than trunk. The C++ source code has an outer loop (do {...} while) and an inner-loop (unrolled into two basic blocks)

For x86, LBB0_1 is the first unrolled basic block in the inner-loop. Clang 16.0.0 generates 14 instructions while trunk generates 15 instructions. Besides, code sequence from trunk are less efficient; lea instructions of format [base + index*scale + displacement] has higher latency and lower throughput than add on some x86 processors (see Agner's instruction table)
For aarch64, in clang 16.0.0, ip += select(is_literal, next_literal_tag + 1, tag_type + 1 sinks from LBB0_3 (the 2nd basic block of unrolled inner loop) into LBB0_5 (the exiting block of outer loop)

The test case is reduced from a function in an open-source compression/decompression library.

mingmingl-llvm commented 10 months ago

https://reviews.llvm.org/D155688 canonicalizes add + gep to gep + gep (a benign optimization), but its interaction with the subsequent passes (simplify-cfg that converts if-else to conditional move, and the code generation pipeline) generates less efficient code.

DragonDisciple commented 7 months ago

Was doing some initial investigation on some degradations we were having in downstream ARM (and other, non-upstream) targets and found this issue. We have discovered similar performance problems with the linked commit co-authored by @paulwalker-arm

The review specifically states in a comment that an 'inbounds preservation is incorrect', but it is this inbounds flag that passes like LoopFlatten (integral for the coremark benchmark matrix-related loops) rely upon to identify possible UB and loosen restrictions on optimization. Specifically, a basic NxN matrix add loop will not flatten.

Flattening tries to take the NxN nested for into a single [0, NN) loop. If NN can overflow, then the optimization can't be performed. To resolve this in some cases, the pass would look at the geps feeding into loads/stores and detect if such an overflow would invoke UB. In such cases, N*N can't overflow because it would invoke UB, and flattening can continue.

This check for UB hinges on inbounds, which is being removed in the transformation.

The change itself has no justification in the commit log or the review, so why was it done? I wouldn't say that the removal of 'inbounds' from a gep is in any way a benign change if optimizations like this are denied. Would it make sense to do the gep + gep transformation iff the inbounds flag is not present?

Edit: Apologies for the misclicked 'completed!

kiranchandramohan commented 7 months ago

Was doing some initial investigation on some degradations we were having in downstream ARM (and other, non-upstream) targets and found this issue. We have discovered similar performance problems with the linked commit co-authored by @paulwalker-arm

To clarify, the patch went through some changes during the review and became more general.

The change itself has no justification in the commit log or the review, so why was it done? I wouldn't say that the removal of 'inbounds' from a gep is in any way a benign change if optimizations like this are denied. Would it make sense to do the gep + gep transformation iff the inbounds flag is not present?

As mentioned in https://reviews.llvm.org/D155688, we saw some benefits with some benchmarks particularly 549.fotonik_3d in spec2017.

Also, see the review comment by @nikic that encouraged a more general approach that leads to a canonical representation.

Alternatively, it would also be possible to canonicalize gep(p, add(x, y)) to gep(gep(p, x), y) in general, in which case existing GEP reassociation support in LICM will take care of the rest. Arguably this is cleaner (as we should have a canonical form between these two possibilities), but it's more likely to cause fallout.

When some concerns were expressed we did offer to revert but at that time there were no strong reasons to do so.

kiranchandramohan commented 7 months ago

I reran 549.fotonik3d_r and still see benefits from https://reviews.llvm.org/D155688 on an AArch64 machine. I can also see that there were subsequent patches (canonicalize sext-add + gep -> gep + gep) that moved in this direction like https://github.com/llvm/llvm-project/pull/69581.

EDIT: Corrected the link to the subsequent patch.

nikic commented 7 months ago

Does https://github.com/llvm/llvm-project/pull/78576 not address the loop flattening regressions?

DragonDisciple commented 7 months ago

Does #78576 not address the loop flattening regressions?

Perhaps it does. Our team is lagging a bit on pulling upstream into our code base, so I'll definitely look into this then. Thanks for bringing it to my attention!