[llvm-mca] bottleneck-analysis result conflict with timeline result on aarch64


Bugzilla Link	47380
Version	trunk
OS	Linux
Reporter	LLVM Bugzilla Contributor
CC	@adibiagio,@LebedevRI,@RKSimon

Extended Description

LLVM (http://llvm.org/): LLVM version 12.0.0git DEBUG build with assertions. Default target: aarch64-unknown-linux-gnu Host CPU: tsv110

bottleneck-analysis

command:

../bin/llvm-mca -mcpu=tsv110 -dispatch=4 -mtriple=aarch64-unkonw-linux-gnu - iterations=10 --bottleneck-analysis float_test.s

result:

Critical sequence based on the simulation:

          Instruction                                 Dependency Information

+----< 0. adrp x6, #0 +----> 1. add x7, x6, #2864 ## REGISTER dependency: x6 | 2. fmov d6, #1.00000000 +----> 3. ldp d0, d4, [x7, #296] ## REGISTER dependency: x7 | 4. ldr d7, [x7, #312] | 5. ldr d19, [x7, #320] +----> 6. fmul d16, d2, d0 ## REGISTER dependency: d0 +----> 7. frinta d5, d16 ## REGISTER dependency: d16 +----> 8. fcvtas x8, d16 ## RESOURCE interference: A57UnitX [ probability: 10% ] | 9. fsub d17, d16, d5 +----> 10. and x9, x8, #0x1f ## REGISTER dependency: x8 | 11. fmadd d18, d4, d17, d7 | 12. fmadd d20, d19, d17, d6 | 13. fmul d21, d17, d17 +----> 14. ldr x10, [x7, x9, lsl #3] ## REGISTER dependency: x9 | 15. add x11, x10, x8, lsl #47 | 16. fmov d23, x11 | 17. fmadd d22, d18, d21, d20 | 18. fmul d24, d22, d23 | 19. fcvt s0, d24 | 20. ldr x14, [x1, #4056] | 21. ldr x2, [sp, #24] +----> 22. ldr x15, [x14] ## RESOURCE interference: A57UnitL [ probability: 20% ] +----> 23. eor x15, x2, x15 ## REGISTER dependency: x15

timeline

command:

../bin/llvm-mca -mcpu=tsv110 -dispatch=4 -mtriple=aarch64-unkonw-linux-gnu - iterations=1 --timeline float_test.s

result:

Timeline view: 0123456789 0123456789 012 Index 0123456789 0123456789 0123456789

[0,0] DeER . . . . . . . . . . . adrp x6, #0 [0,1] D=eER. . . . . . . . . . . add x7, x6, #2864 [0,2] DeeeER . . . . . . . . . . fmov d6, #1.00000000 [0,3] .D=eeeeeER. . . . . . . . . . ldp d0, d4, [x7, #296] [0,4] .D==eeeeeER . . . . . . . . . ldr d7, [x7, #312] [0,5] .D===eeeeeER . . . . . . . . . ldr d19, [x7, #320] [0,6] . D=====eeeeeER. . . . . . . . . fmul d16, d2, d0 [0,7] . D==========eeeeeER. . . . . . . . frinta d5, d16 [0,8] . D==========eeeeeeeeeeER. . . . . . . fcvtas x8, d16 [0,9] . D==============eeeeeER. . . . . . . fsub d17, d16, d5 [0,10] . D===================eER . . . . . . and x9, x8, #0x1f [0,11] . D===================eeeeeeeeeER . . . . . fmadd d18, d4, d17, d7 [0,12] . D===================eeeeeeeeeER . . . . . fmadd d20, d19, d17, d6 [0,13] . D===================eeeeeE---R . . . . . fmul d21, d17, d17 [0,14] . D===================eeeeE----R . . . . . ldr x10, [x7, x9, lsl #3] [0,15] . D=======================eeE--R . . . . . add x11, x10, x8, lsl #47 [0,16] . D========================eeeeeER . . . . fmov d23, x11 [0,17] . D==========================eeeeeeeeeER . . . fmadd d22, d18, d21, d20 [0,18] . D===================================eeeeeER . . fmul d24, d22, d23 [0,19] . D========================================eeeeeER fcvt s0, d24 [0,20] . .DeeeeE----------------------------------------R ldr x14, [x1, #4056] [0,21] . .D=eeeeE---------------------------------------R ldr x2, [sp, #24] [0,22] . .D====eeeeE------------------------------------R ldr x15, [x14] [0,23] . .D========eE-----------------------------------R eor x15, x2, x15

Float Instructions need to be in the Critical sequence for FSU and register dependency

| 17. fmadd d22, d18, d21, d20 | 18. fmul d24, d22, d23 | 19. fcvt s0, d24

The bottleneck analysis is not a "critical-path" analysis. The analysis is conducted at simulation time; it is purely based on the observation of so-called "pressure increase" events, usually generated by a Scheduler component. Pressure events are generated only in two situations:

hardware pipeline utilisation could be increased if instructions weren't subject to data dependencies.
There are instructions ready to execute. However pipelines are fully booked, and the number of instructions dispatched during that cycle was bigger than the number of instructions issued on the underlying pipes.

Essentially: point 1. is about data dependencies limiting the issue throughput. Point 2. is instead about pipeline resources being unavailable, and a too low issue rate (despite instructions are free from data dependencies).

Therefore, not all a data dependencies are necessarily seen as "problematic" for the purpouse of this analysis. Only those that limit the issue throughput are problematic.

Back to your example: it may be that those dependencies are not problematic during the first ten iterations of the loop. Those may introduce problems if the number of iterations is increased.

Your timeline only shows that there are data dependencies. Nothing more. The effects of those dependencies on the throughput may only becoming apparent if you increase the number of iterations to something more than 10. During that short simulation, the scheduler was probably still able to extract enough ILP and feed the underlying pipes. Over time, problematic dependencies would induce an increase in back-pressure on the scheduler buffers, eventually leading to compulsory stalls. It may be that 10 iterations wasn't enough to reach that critical point.

Generally speaking, when doing bottleneck analysis (or throughput analysis in general), it is strongly advised to use a large number of iterations. If possible, I recommend to stick with the default (i.e. 100 iterations) unless there are compelling reasons for doing it differently.

llvm / llvm-project