llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.03k stars 11.57k forks source link

instruction scheduler is overly pessimistic about instruction availability #11961

Open hfinkel opened 12 years ago

hfinkel commented 12 years ago
Bugzilla Link 11589
Version trunk
OS All
Attachments Test case
CC @atrick,@efriedma-quic

Extended Description

The scoreboard hazard detector that I've added for the PPC 440 is not detecting hazards as it should (which certainly could be my fault somehow, but...). For example, it will produce a schedule that looks like...

SU(28): 0x127969b0: f64,ch = LFD 0x12793aa0, 0x1277b4f0, 0x127965b0<Mem:LD8%scevgep100> [ORD=41] [ID=28] SU(46): 0x12796ab0: f64 = FADD 0x127969b0, 0x127968b0 [ORD=42] [ID=46] SU(27): 0x12796cb0: ch = STFD 0x12796ab0, 0x12793aa0, 0x1277b3f0, 0x127969b0:1<Mem:ST8%scevgep103> [ORD=46] [ID=27] SU(26): 0x127970b0: f64,ch = LFD 0x127941a0, 0x1277b4f0, 0x12796cb0<Mem:LD8%scevgep94(tbaa=!"double")> [ORD=50] [ID=26] SU(47): 0x127972c0: f64 = FADD 0x127970b0, 0x127968b0 [ORD=51] [ID=47] SU(25): 0x127974c0: ch = STFD 0x127972c0, 0x127941a0, 0x1277b3f0, 0x127970b0:1<Mem:ST8%scevgep97(tbaa=!"double")> [ORD=55] [ID=25]

in other words, it produces a set of load, add, store triples, non-interleaved, in order. The problem is that the result of the load is not immediately available, and either is the result of the add. The loads are covered by the itinerary:

InstrItinData<LdStLFD , [InstrStage<1, [IFTH1, IFTH2]>, InstrStage<1, [PDCD1, PDCD2]>, InstrStage<1, [DISS1, DISS2]>, InstrStage<1, [LRACC]>, InstrStage<1, [AGEN]>, InstrStage<1, [CRD]>, InstrStage<2, [LWB]>], [9, 5, 5], [NoBypass, GPR_Bypass, GPR_Bypass]>,

the add is covered by the itinerary:

InstrItinData<FPGeneral , [InstrStage<1, [IFTH1, IFTH2]>, InstrStage<1, [PDCD1, PDCD2]>, InstrStage<1, [DISS1, DISS2]>, InstrStage<1, [FRACC]>, InstrStage<1, [FEXE1]>, InstrStage<1, [FEXE2]>, InstrStage<1, [FEXE3]>, InstrStage<1, [FEXE4]>, InstrStage<1, [FEXE5]>, InstrStage<1, [FEXE6]>, InstrStage<1, [FWB]>], [10, 4, 4], [FPR_Bypass, FPR_Bypass, FPR_Bypass]>,

the store is covered by:

InstrItinData<LdStUX , [InstrStage<1, [IFTH1, IFTH2]>, InstrStage<1, [PDCD1, PDCD2]>, InstrStage<1, [DISS1, DISS2]>, InstrStage<1, [LRACC]>, InstrStage<1, [AGEN]>, InstrStage<1, [CRD]>, InstrStage<1, [LWB]>], [8, 5, 5], [NoBypass, GPR_Bypass, GPR_Bypass]>,

So, say that the load dispatches in cycle 1. According to the itinerary, the result of the load is not available until cycle 9. The add dispatches in the same cycle, or one cycle later. In the best case, it dispatches one cycle later (in cycle 2). It expects to read its inputs 4 cycles later in cycle number 6. The input, however, will not be available until cycle 9 yielding a 3 cycle stall. As far as I can tell by looking at the debug output, no hazard was reported by the scoreboard detector. Is this a bug or am I doing something wrong?

I've attached a small test case, run with llc -mcpu=440

hfinkel commented 12 years ago

I am working on a patch to SelectionDAGBuilder to correct this. I'll submit it for review soon.

efriedma-quic commented 12 years ago

So the last operand of the load is the result of the store. Because the value type is MVT::Other, ScheduleDAGSDNodes::AddSchedEdges() considers this to be part of the critical chain. Why are the loads and stores (into independent arrays) connected in this way?

That's a known missed optimization.

hfinkel commented 12 years ago

It seems that the answer to that is no; that piece of code is not executed. The dependencies seem to be added because the SDNodes are like:

SU(6): 0x228add0: ch = STFD 0x228abd0, 0x228a7d0, 0x228d3a0, 0x228a9d0:1<Mem:ST8%arrayidx5(tbaa=!"double")> [ORD=6] [ID=6]

SU(15): 0x228ec70: i32 = LA 0x228ea70, 0x228e970 [ID=15]

SU(5): 0x228b1d0: f64,ch = LFD 0x228d4a0, 0x228ec70, 0x228add0<Mem:LD8%arrayidx.1> [ORD=9] [ID=5]

So the last operand of the load is the result of the store. Because the value type is MVT::Other, ScheduleDAGSDNodes::AddSchedEdges() considers this to be part of the critical chain. Why are the loads and stores (into independent arrays) connected in this way?

hfinkel commented 12 years ago

Is this caused by the TODO in the following comment in CodeGen/ScheduleDAGInstrs.cpp (BuildSchedGraph)?

// Add chain dependencies.
// Chain dependencies used to enforce memory order should have
// latency of 0 (except for true dependency of Store followed by
// aliased Load... we estimate that with a single cycle of latency
// assuming the hardware will bypass)
// Note that isStoreToStackSlot and isLoadFromStackSLot are not usable
// after stack slots are lowered to actual addresses.
// TODO: Use an AliasAnalysis and do real alias-analysis queries, and
// produce more precise dependence information.
hfinkel commented 12 years ago

Looking into this more, it seems that the problem is that the scheduler never has more than one instruction from which to choose.

The instruction stream is essentially a series of: %inc15 = or i32 %i.013, 1 %arrayidx.1 = getelementptr inbounds [16000 x double] @​Y, i32 0, i32 %inc15 %1 = load double %arrayidx.1, align 8, !tbaa !​0 %add.1 = fadd double %1, 1.000000e+00 %arrayidx5.1 = getelementptr inbounds [16000 x double] @​X, i32 0, i32 %inc15 store double %add.1, double %arrayidx5.1, align 8, !tbaa !​0 %inc.116 = or i32 %i.013, 2 %arrayidx.2 = getelementptr inbounds [16000 x double] @​Y, i32 0, i32 %inc.116 %2 = load double %arrayidx.2, align 16, !tbaa !​0 %add.2 = fadd double %2, 1.000000e+00 %arrayidx5.2 = getelementptr inbounds [16000 x double] @​X, i32 0, i32 %inc.116 store double %add.2, double %arrayidx5.2, align 16, !tbaa !​0

And during scheduling we have: Examining Available: Examining Available: Height 39: SU(9): 0x34c4db0: ch = STFD 0x34c49b0, 0x34c6ad0, 0x34a2bf0, 0x34c48b0:1<Mem:ST8%scevgep18(tbaa=!"double")> [ORD=126] [ID=9]

*** Scheduling [39]: SU(9): 0x34c4db0: ch = STFD 0x34c49b0, 0x34c6ad0, 0x34a2bf0, 0x34c48b0:1<Mem:ST8%scevgep18(tbaa=!"double")> [ORD=126] [ID=9] ... Examining Available: Height 42: SU(55): 0x34c49b0: f64 = FADD 0x34c48b0, 0x34bb4f0 [ORD=122] [ID=55]

*** Scheduling [42]: SU(55): 0x34c49b0: f64 = FADD 0x34c48b0, 0x34bb4f0 [ORD=122] [ID=55] ... Examining Available: Height 48: SU(10): 0x34c48b0: f64,ch = LFD 0x34c6ad0, 0x34a2cf0, 0x34c46b0<Mem:LD8%scevgep21(tbaa=!"double")> [ORD=121] [ID=10]

*** Scheduling [48]: SU(10): 0x34c48b0: f64,ch = LFD 0x34c6ad0, 0x34a2cf0, 0x34c46b0<Mem:LD8%scevgep21(tbaa=!"double")> [ORD=121] [ID=10] ...

Is this an alias-analysis problem?