Optimization arm_ldst_opt inserts newly generated instruction vldmia at incorrect position

llvmbot commented 11 years ago


Bugzilla Link	14824
Resolution	FIXED
Resolved on	Apr 18, 2013 20:50
Version	trunk
OS	Linux
Attachments	The small test case verifying the bug in arm_ldst_opt, The assembly code for test case ldst_opt_bug.ll
Reporter	LLVM Bugzilla Contributor
CC	@jmolloy,@kaomoneus

Extended Description

Hi,

Optimization arm_ldst_opt inserts newly generated instruction vldmia at incorrect position.

For attached small test case ldst_opt_bug.ll, using this command line,

llc -mcpu=cortex-a9 -mattr=+neon,+neonfp ldst_opt_bug.ll

we would see incorrect instruction sequence in ldst_opt_bug.s like below,

    vldr    s0, [r0, #&#8203;432]
    vldr    s5, [r1, #&#8203;412]
    vldr    s14, [r1, #&#8203;440]
    vldr    s10, [r0, #&#8203;440]
    vldmia  r3, {s0, s1, s2, s3}
    add.w   r3, r2, #&#8203;432
    vldr    s7, [r1, #&#8203;420]

That is, instruction "vldmia r3, {s0, s1, s2, s3}" overwrites s0 just loaded by "vldr s0, [r0, #432]", which is incorrect according to the original LLVM IR semantics in ldst_opt_bug.ll.

In optimization arm_ldst_opt, before generating instruction vldmia, we have the following IR,

(1) %S0 = VLDRS %R0, 102, pred:14, pred:%noreg, %Q0; mem:LD4%arrayidx67+24 (2) %S1 = VLDRS %R0, 103, pred:14, pred:%noreg, %Q0<imp-use,kill>, %Q0; mem:LD4[%arrayidx67+28] (3) %S11 = VLDRS %R0, 111, pred:14, pred:%noreg, %Q2<imp-use,kill>, %Q2; mem:LD4[%arrayidx67+60] (4) %S10 = VLDRS %R0, 110, pred:14, pred:%noreg, %Q2<imp-use,kill>, %Q2; mem:LD4%arrayidx67+56 (5) %S1 = VLDRS %R0, 109, pred:14, pred:%noreg, %Q0<imp-use,kill>, %Q0; mem:LD4[%arrayidx67+52] (6) %S0 = VLDRS %R0, 108, pred:14, pred:%noreg, %Q0<imp-use,kill>, %Q0; mem:LD4%arrayidx67+48 (7) %S3 = VLDRS %R0, 105, pred:14, pred:%noreg, %Q0<imp-use,kill>, %Q0; mem:LD4[%arrayidx67+36] (8) %S2 = VLDRS %R0, 104, pred:14, pred:%noreg, %Q0<imp-use,kill>, %Q0; mem:LD4%arrayidx67+32 (9) %S7 = VLDRS %R1, 105, pred:14, pred:%noreg, %Q1<imp-use,kill>, %Q1; mem:LD4[%arrayidx64+36]

The optimization tries to hoist instruction 7) and 8) to be able to merge with 1) and 2) to generate vldm, because they are loading sequential memory at offset 1024, 1034, 1044, 1054. This intention of the optimization itself is correct.

After hoist, the algorithm firstly generates an internal instruction sequence,

(1) (2) (7) (8)

The problem is the newly generated instruction vldm is incorrectly inserted after instruction 8). Obviously the data dependence is violated here with instruction 6).

The source code introducing the bug is has something to do with the function ARMLoadStoreOpt::MergeOpsUpdate,

// Try to do the merge. MachineBasicBlock::iterator Loc = memOps[insertAfter].MBBI; ++Loc; if (!MergeOps(MBB, Loc, Offset, Base, BaseKill, Opcode, Pred, PredReg, Scratch, dl, Regs, ImpDefs)) return;

When Loc is (8), ++Loc is (9).

llvmbot commented 11 years ago

Fixed in 179751 http://llvm.org/viewvc/llvm-project?view=revision&revision=179751

kaomoneus commented 11 years ago

Hi all! The link to thread with patch: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130311/167936.html

kaomoneus commented 11 years ago

I proposed new fix to llvm-commits. It is still improvement of one proposed by Hao. But it also allows to track confligts between different register classes, like S5 vs D2. IMHO: Load-store opt have to been fixed. It ought to track that kind of conflicts. Or there must be some guarantee that all input BBs are clear. So in last case there should be some pass in load-store-opt requirements that cleans the code. I think its easer to embed the conflicts tracking.

jmolloy commented 11 years ago

However, for whatever reason, if only we input the LLVM IR sequence containing WAW dependence, ld/st optimization would potentially fail due to missing handling WAW in the queue being used in current algorithm.

While I agree, the pass does already check for volatility of memory accesses and does not attempt to optimize them. I can't think of a way of having legitimate input code with a write-after-write dependence with non-volatile accesses.

So potentially the load-store optimization pass does not need to be fixed up at all, and the change in the instruction selector could suffice.

llvmbot commented 11 years ago

I think they should be treated as two separate issues in ISel and ld/st optimization.

Yes, maybe improving ISel can remove the dead code as shown as WAW dependence before entering ld/st optimization. However, for whatever reason, if only we input the LLVM IR sequence containing WAW dependence, ld/st optimization would potentially fail due to missing handling WAW in the queue being used in current algorithm.

llvmbot commented 11 years ago

A quick fix. Remove the dead instructions in MemOps. Ignore them in ARMLoadStoreOptimization.

llvmbot commented 11 years ago

Yes, but look at the produced assembler source with the passes disabled. I don't have my notes with me, but I seem to recall that data is loaded into a series of "S" registers, they moved from the "S" registers into a non-overlapping "q" register, then the upper "d" slot of the register is stored, but the lower "d" slot is then overwritten. Similarly, the original "S" register is also overwritten before being stored or copied to any other register than the one mentioned previously. I'll post again when I get tomorrow morning.

By the way, what is the source that you are working from to generate this IR? As my colleague Jakob observed, passing a vector of elements that have illegal "machine" type, as i64 is for ARM, is likely to be hitting areas of code that aren't exercised very often by typical use of the tool chain.

I use the source in the attachment added by Jiangning.

I focused on the ARMLoadStoreOptimization and not so familiar with other code (I'm a newbie in LLVM). There is a quick fix way to solve this bug: just don't optimize the dead instructions in the Pass (the patch is in the attachment). It's the easiest way, but not the best way. If you can solve this problem in the instruction selector, that's more better.

llvmbot commented 11 years ago

Yes, but look at the produced assembler source with the passes disabled. I don't have my notes with me, but I seem to recall that data is loaded into a series of "S" registers, they moved from the "S" registers into a non-overlapping "q" register, then the upper "d" slot of the register is stored, but the lower "d" slot is then overwritten. Similarly, the original "S" register is also overwritten before being stored or copied to any other register than the one mentioned previously. I'll post again when I get tomorrow morning.

By the way, what is the source that you are working from to generate this IR? As my colleague Jakob observed, passing a vector of elements that have illegal "machine" type, as i64 is for ARM, is likely to be hitting areas of code that aren't exercised very often by typical use of the tool chain.

llvmbot commented 11 years ago

So as my investigation, the Pass merges several vldr into a vldmia, but insert the vldmia into a wrong place and this violates the Write After Write (WAW) dependence.

More details in ARMLoadStoreOptimization:

This pass has two steps: Step 1: Selects sequential instructions with the same opcode and base register. Reorders them in ascending offset order and adds them to a queue named MemOps. Step 2: Merges several Load/Store instructions into a multipleOpcode instruction. Inserts the new merged multiple instruction in the last.

For example, if the original code is as following: (1) %S0 = VLDRS %R0, 102 (2) %S1 = VLDRS %R0, 103 (3) %S11 = VLDRS %R0, 111 (4) %S10 = VLDRS %R0, 110 (5) %S1 = VLDRS %R0, 109 (6) %S0 = VLDRS %R0, 108 (7) %S3 = VLDRS %R0, 105 (8) %S2 = VLDRS %R0, 104 (9) %S7 = VLDRS %R1, 105 In step 1. they are reordered into MemOps queue: (1) %S0 = VLDRS %R0, 102 (2) %S1 = VLDRS %R0, 103 (8) %S2 = VLDRS %R0, 104 (7) %S3 = VLDRS %R0, 105 (6) %S0 = VLDRS %R0, 108 (5) %S1 = VLDRS %R0, 109 (4) %S10 = VLDRS %R0, 110 (3) %S11 = VLDRS %R0, 111 In step 2. (1)(2)(8)(7) are merged into a VLDMIA instruction and inserted after the old position of (8): (3) %S11 = VLDRS %R0, 111 (4) %S10 = VLDRS %R0, 110 (5) %S1 = VLDRS %R0, 109 (6) %S0 = VLDRS %R0, 108 ( ) VLDMIA R3,{S0, S1, S2, S3} ;R3 records the start address (9) %S7 = VLDRS %R1, 105

It's obviously that VLDMIA overwrites register S0 and S1.

llvmbot commented 11 years ago

Comments extracted from radar 12973840:

1/22/13 3:08 PM Joel Jones: Neither of the ARM load/store optimizers (ARMLoadStoreOpt and ARMPreAllocLoadStoreOpt) is directly responsible for the problem with overwriting (by loading) a register before it is written/stored. This was verified by adding disabling code to ARMTargetMachine.cpp and by putting a debug message as the first line of the runOnMachineFunction for both of the load/store optimizers. When the optimizers were disabled, the debug message was not in the output. When the optimizers were enabled, both debug messages were there. In both cases, the resulting .s files were different and the register overwriting behavior was exhibited by both. The formation of the vldmia wass a red herring.

1/30/13 9:45 AM Joel Jones: Actually, this problem seems to be in the instruction selector. After discussion with Jakob, we theorized that the "<8 x i64>" may be causing problems on ARM, since i64 isn't a legal type (because the GPRs can't handle them). My plan is to make sure that the loads that don't have corresponding stores in the output from instruction selection are actually stored in the input IR.

Hi Joel, I am working on this bug. But I doubt the result of disabling optimizers to ARMTargetMachine.cpp. I tried by disabling ARMLoadStoreOptimization Pass: bool ARMPassConfig::addPreSched2() { // FIXME: temporarily disabling load / store optimization pass for Thumb1. if (getOptLevel() != CodeGenOpt::None) { if (!getARMSubtarget().isThumb1Only()) { //addPass(createARMLoadStoreOptimizationPass()); printAndVerify("After ARM load / store optimizer"); }

And the register overwriting behavior was disappeared. The generated code was: vldr s0, [r0, #408] ... vldr s0, [r0, #432]

If enabled ARMLoadStoreOptimization Pass (Do not comment addPass). The generated code was: vldr s0, [r0, #432] ... vldmia r3, {s0, s1, s2, s3} //r3 is #408 The ARMLoadStoreOptimization Pass changed the source code and made the generated .s code incorrect ( s0 and s1 is wrongly overwritten ).

llvmbot commented 11 years ago

Comments extracted from radar 12973840:

1/22/13 3:08 PM Joel Jones: Neither of the ARM load/store optimizers (ARMLoadStoreOpt and ARMPreAllocLoadStoreOpt) is directly responsible for the problem with overwriting (by loading) a register before it is written/stored. This was verified by adding disabling code to ARMTargetMachine.cpp and by putting a debug message as the first line of the runOnMachineFunction for both of the load/store optimizers. When the optimizers were disabled, the debug message was not in the output. When the optimizers were enabled, both debug messages were there. In both cases, the resulting .s files were different and the register overwriting behavior was exhibited by both. The formation of the vldmia wass a red herring.

1/30/13 9:45 AM Joel Jones: Actually, this problem seems to be in the instruction selector. After discussion with Jakob, we theorized that the "<8 x i64>" may be causing problems on ARM, since i64 isn't a legal type (because the GPRs can't handle them). My plan is to make sure that the loads that don't have corresponding stores in the output from instruction selection are actually stored in the input IR.

llvm / llvm-project

Optimization arm_ldst_opt inserts newly generated instruction vldmia at incorrect position #15196

Extended Description