[LSR] LSR is producing bad code

JonPsson commented 7 years ago


Bugzilla Link	32548
Version	trunk
OS	Linux
Attachments	reduced test case, LSR debug output, An even smaller test case with 3+1 stores that LSR fails with
CC	@atrick,@hfinkel,@JonPsson,@qcolombet,@uweigand

Extended Description

In the spirit of healthy competition between gcc and llvm, I have found a a regression on a benchmark. At least one of the problems is that LSR is not producing good code (while gcc indeed does).

On SystemZ, some memory instructions only support immediate offset of 12 bits. This is the reason for the hook isFoldableMemAccessOffset(), which adds extra cost if the offset does not fit.

This hook isn't enough to help this test case. LSR produces big offsets even as extra costs are added for them. I tried adding experimental big costs, and even tried incrementing NumRegs in LSR when such an offset was present, but this did not help. I suspect that either the right formula isn't generated in the first place, or there is some other heuristic that unfortunately takes precedence somehow.

The cost on SystemZ for a too big offset for a memory access is one extra instruction to add the offset.

I would appreciate any help on this,

Jonas

Run with: bin/llc -mtriple=s390x-linux-gnu -mcpu=z13

atrick commented 7 years ago

LSR should be able to choose a solution involving multiple phis across a set of uses. It probably won't be able to formulate a solution for a single use involving multiple phis if that's what you're after.

JonPsson commented 7 years ago

I notice that this code to check profitable chains does not check if the constant is foldable and probably should:

// Incrementing by zero or some constant is neutral. We assume constants can // be folded into an addressing mode or an add's immediate operand. if (isa(Inc.IncExpr)) { ++NumConstIncrements; continue; }

I tried your suggestion with

diff --git a/lib/Transforms/Scalar/LoopStrengthReduce.cpp b/lib/Transforms/Scalar/LoopStrengthReduce.cpp index b027278..bc836be 100644 --- a/lib/Transforms/Scalar/LoopStrengthReduce.cpp +++ b/lib/Transforms/Scalar/LoopStrengthReduce.cpp @@ -2742,9 +2742,14 @@ isProfitableChain(IVChain &Chain, SmallPtrSetImpl<Instruction*> &Users,

 // Incrementing by zero or some constant is neutral. We assume constants can
 // be folded into an addressing mode or an add's immediate operand.

if (isa(Inc.IncExpr)) {
++NumConstIncrements;
continue;
if (const SCEVConstant *IncConst = dyn_cast(Inc.IncExpr)) {
if (IncConst != nullptr &&
((!isa(Inc.UserInst) && !isa(Inc.UserInst)) ||
TTI.isFoldableMemAccessOffset(Inc.UserInst,
IncConst->getAPInt().getSExtValue()))) {
++NumConstIncrements;
continue;
} }

if (Inc.IncExpr == LastIncExpr) @@ -2998,7 +3003,9 @@ static bool canFoldIVIncExpr(const SCEV IncExpr, Instruction UserInst, if (!IncConst || !isAddressUse(UserInst, Operand)) return false;
if (IncConst->getAPInt().getMinSignedBits() > 64)
if (IncConst->getAPInt().getMinSignedBits() > 64 ||
!TTI.isFoldableMemAccessOffset(UserInst,
IncConst->getAPInt().getSExtValue())) return false;

MemAccessTy AccessTy = getAccessType(UserInst); @@ -3007,6 +3014,8 @@ static bool canFoldIVIncExpr(const SCEV IncExpr, Instruction UserInst, IncOffset, /HaseBaseReg=/false)) return false;

, but while some decisions changed regarding chains, I did not see any improvements in instruction counts of innermost loops (actually getting somewhat worse overall). Perhaps this just isn't good enough, since per the comment this is something a smart scheduler ideally should handle. Right now, IIUC the offsets are just added to the current previous address.

I don't know why you think LSR won't create phis. I meant that it will not make use of an extra address register, so that in the case where there are very varying offsets it would not try to pick just the best one, but actually use two PHIs if there is an extra reg estimated to be available. Maybe this could be added as an extra check for the immediate offsets, so that it may do this split of the found solution in those cases? But I have been told gcc actually can select a subset of possible PHIs and so use multiple registers in cases like this...

That looks like what you want from your description. So why isn't LSR choosing those formula?

In this test case, there are 3 fixups with huge negative offsets, and 1 fixup with huge positive offset. It would have been best if LSR picked to offset the IV with a negative offsets, so that the group of 3 would lie closer the IV. What instead happens is that the IV starts with the positive offset, and then there are 3 x (load iv; 64bit subtract) in the loop :-(

The only difference in the formulas seems to be the "imm cost", which I am not sure exactly what is, and how it relates to foldable immediate offsets..(?):

New best at 4 instructions 1 reg, with addrec cost 1, plus 3 base adds, plus 31 imm cost, plus 1 setup cost. Regs: {(-1615888 + %dstGrid),+,160}<%for.body>

New best at 4 instructions 1 reg, with addrec cost 1, plus 3 base adds, plus 1 setup cost. Regs: {(1600280 + %dstGrid),+,160}<%for.body>

(Again, using two PHIs instead of choosing just one of these would have been ideal.)

You will be able to tune your target cost model when https://reviews.llvm.org/D30561 lands. Thanks - I guess I should start to experiment with this. It seems that the immediate offsets are not important enough in the current function.

qcolombet commented 7 years ago

You will be able to tune your target cost model when https://reviews.llvm.org/D30561 lands.

atrick commented 7 years ago

Make sure LSR is not "chaining" the stores before solving the formulae. I don't think this is your problem because I don't see "Final Chain:" in the debug output. I just wanted make you aware of it because I notice that this code to check profitable chains does not check if the constant is foldable and probably should:

// Incrementing by zero or some constant is neutral. We assume constants can
// be folded into an addressing mode or an add's immediate operand.
if (isa<SCEVConstant>(Inc.IncExpr)) {
  ++NumConstIncrements;
  continue;
}

I don't know why you think LSR won't create phis. Looking at the debug output I see that different uses have possible solutions with different recurrences:

LSR Use: Kind=Address of double in addrspace(0), Offsets={0}, widest fixup type: double reg({(1600280 + %dstGrid),+,160}<%for.body>) LSR Use: Kind=Address of double in addrspace(0), Offsets={0}, widest fixup type: double reg({(-1615888 + %dstGrid),+,160}<%for.body>)

That looks like what you want from your description. So why isn't LSR choosing those formula? I know it likes to minimize "addrecs", but "79 imm cost" seems extreme.

JonPsson commented 7 years ago

I have experimented for a while with this now, and would like some advice regarding legality checks:

Ideally, it would be nice to split a PHI node and make a new PHI node with a different starting offset from the base pointer, while using the same iv-increment (step). In this way, groups of memory accesses could be formed so that they all can fold the immediate offset.

So, I wonder exactly what checks I must do before making a new IV for a memory access? I don't see LSR calling hasNoSelfWrap(), although this is what I had expected while processing SCEVs for the IVUsers.

In my simple experiment, I take cases of a loop invariant base pointer, add/sub an immediate offset in the preheader, and then add immediate offsets to that iv for the depending memory accesses. I am not sure how the overflow issue should be checked, but I guess that let's say in a C program there was an unsigned index variable, then it must have a preserved overflow behaviour, right?

Thanks for any advice,

Jonas

llvm / llvm-project

[LSR] LSR is producing bad code #31895

Extended Description