llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.11k stars 12.01k forks source link

[llvm-mca] bottleneck-analysis result conflict with timeline result on aarch64 #46724

Open llvmbot opened 4 years ago

llvmbot commented 4 years ago
Bugzilla Link 47380
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @adibiagio,@LebedevRI,@RKSimon

Extended Description

LLVM (http://llvm.org/): LLVM version 12.0.0git DEBUG build with assertions. Default target: aarch64-unknown-linux-gnu Host CPU: tsv110

bottleneck-analysis

command:

../bin/llvm-mca -mcpu=tsv110 -dispatch=4 -mtriple=aarch64-unkonw-linux-gnu - iterations=10 --bottleneck-analysis float_test.s

result:

Critical sequence based on the simulation:

          Instruction                                 Dependency Information

+----< 0. adrp x6, #​0 +----> 1. add x7, x6, #​2864 ## REGISTER dependency: x6 | 2. fmov d6, #​1.00000000 +----> 3. ldp d0, d4, [x7, #​296] ## REGISTER dependency: x7 | 4. ldr d7, [x7, #​312] | 5. ldr d19, [x7, #​320] +----> 6. fmul d16, d2, d0 ## REGISTER dependency: d0 +----> 7. frinta d5, d16 ## REGISTER dependency: d16 +----> 8. fcvtas x8, d16 ## RESOURCE interference: A57UnitX [ probability: 10% ] | 9. fsub d17, d16, d5 +----> 10. and x9, x8, #​0x1f ## REGISTER dependency: x8 | 11. fmadd d18, d4, d17, d7 | 12. fmadd d20, d19, d17, d6 | 13. fmul d21, d17, d17 +----> 14. ldr x10, [x7, x9, lsl #​3] ## REGISTER dependency: x9 | 15. add x11, x10, x8, lsl #​47 | 16. fmov d23, x11 | 17. fmadd d22, d18, d21, d20 | 18. fmul d24, d22, d23 | 19. fcvt s0, d24 | 20. ldr x14, [x1, #​4056] | 21. ldr x2, [sp, #​24] +----> 22. ldr x15, [x14] ## RESOURCE interference: A57UnitL [ probability: 20% ] +----> 23. eor x15, x2, x15 ## REGISTER dependency: x15

timeline

command:

../bin/llvm-mca -mcpu=tsv110 -dispatch=4 -mtriple=aarch64-unkonw-linux-gnu - iterations=1 --timeline float_test.s

result:

Timeline view: 0123456789 0123456789 012 Index 0123456789 0123456789 0123456789

[0,0] DeER . . . . . . . . . . . adrp x6, #​0 [0,1] D=eER. . . . . . . . . . . add x7, x6, #​2864 [0,2] DeeeER . . . . . . . . . . fmov d6, #​1.00000000 [0,3] .D=eeeeeER. . . . . . . . . . ldp d0, d4, [x7, #​296] [0,4] .D==eeeeeER . . . . . . . . . ldr d7, [x7, #​312] [0,5] .D===eeeeeER . . . . . . . . . ldr d19, [x7, #​320] [0,6] . D=====eeeeeER. . . . . . . . . fmul d16, d2, d0 [0,7] . D==========eeeeeER. . . . . . . . frinta d5, d16 [0,8] . D==========eeeeeeeeeeER. . . . . . . fcvtas x8, d16 [0,9] . D==============eeeeeER. . . . . . . fsub d17, d16, d5 [0,10] . D===================eER . . . . . . and x9, x8, #​0x1f [0,11] . D===================eeeeeeeeeER . . . . . fmadd d18, d4, d17, d7 [0,12] . D===================eeeeeeeeeER . . . . . fmadd d20, d19, d17, d6 [0,13] . D===================eeeeeE---R . . . . . fmul d21, d17, d17 [0,14] . D===================eeeeE----R . . . . . ldr x10, [x7, x9, lsl #​3] [0,15] . D=======================eeE--R . . . . . add x11, x10, x8, lsl #​47 [0,16] . D========================eeeeeER . . . . fmov d23, x11 [0,17] . D==========================eeeeeeeeeER . . . fmadd d22, d18, d21, d20 [0,18] . D===================================eeeeeER . . fmul d24, d22, d23 [0,19] . D========================================eeeeeER fcvt s0, d24 [0,20] . .DeeeeE----------------------------------------R ldr x14, [x1, #​4056] [0,21] . .D=eeeeE---------------------------------------R ldr x2, [sp, #​24] [0,22] . .D====eeeeE------------------------------------R ldr x15, [x14] [0,23] . .D========eE-----------------------------------R eor x15, x2, x15

Float Instructions need to be in the Critical sequence for FSU and register dependency

| 17. fmadd d22, d18, d21, d20 | 18. fmul d24, d22, d23 | 19. fcvt s0, d24

adibiagio commented 3 years ago

The bottleneck analysis is not a "critical-path" analysis. The analysis is conducted at simulation time; it is purely based on the observation of so-called "pressure increase" events, usually generated by a Scheduler component. Pressure events are generated only in two situations:

  1. hardware pipeline utilisation could be increased if instructions weren't subject to data dependencies.
  2. There are instructions ready to execute. However pipelines are fully booked, and the number of instructions dispatched during that cycle was bigger than the number of instructions issued on the underlying pipes.

Essentially: point 1. is about data dependencies limiting the issue throughput. Point 2. is instead about pipeline resources being unavailable, and a too low issue rate (despite instructions are free from data dependencies).

Therefore, not all a data dependencies are necessarily seen as "problematic" for the purpouse of this analysis. Only those that limit the issue throughput are problematic.

Back to your example: it may be that those dependencies are not problematic during the first ten iterations of the loop. Those may introduce problems if the number of iterations is increased.

Your timeline only shows that there are data dependencies. Nothing more. The effects of those dependencies on the throughput may only becoming apparent if you increase the number of iterations to something more than 10. During that short simulation, the scheduler was probably still able to extract enough ILP and feed the underlying pipes. Over time, problematic dependencies would induce an increase in back-pressure on the scheduler buffers, eventually leading to compulsory stalls. It may be that 10 iterations wasn't enough to reach that critical point.

Generally speaking, when doing bottleneck analysis (or throughput analysis in general), it is strongly advised to use a large number of iterations. If possible, I recommend to stick with the default (i.e. 100 iterations) unless there are compelling reasons for doing it differently.