llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.24k stars 12.07k forks source link

[MCA] Inaccuracy in small snippet #99395

Open boomanaiden154 opened 4 months ago

boomanaiden154 commented 4 months ago

Take the small snippet:

incq %r15
addq $0x4, %r13
cmpq $0x3f, %r15

Running this through MCA on skylake/skylake-avx512 produces the following:

Iterations:        100
Instructions:      300
Total Cycles:      104
Total uOps:        300

Dispatch Width:    6
uOps Per Cycle:    2.88
IPC:               2.88
Block RThroughput: 0.8

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.25                        incq  %r15
 1      1     0.25                        addq  $4, %r13
 1      1     0.25                        cmpq  $63, %r15

Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     0.75   0.75    -      -      -     0.75   0.75    -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -     0.24   0.25    -      -      -     0.26   0.25    -     incq  %r15
 -      -     0.25   0.25    -      -      -     0.25   0.25    -     addq  $4, %r13
 -      -     0.26   0.25    -      -      -     0.24   0.25    -     cmpq  $63, %r15

However, running this within llvm-exegesis (llvm-exegesis -snippets-file=/tmp/test.s --mode=latency) produces the following:

---
mode:            latency
key:
  instructions:
    - 'INC64r R15 R15'
    - 'ADD64ri8 R13 R13 i_0x4'
    - 'CMP64ri8 R15 i_0x3f'
  config:          ''
  register_initial_values:
    - 'R15=0x123456'
    - 'R13=0x123456'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-pc-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 0.4234, per_snippet_value: 1.26995, validation_counters: {} }
error:           ''
info:            ''
assembled_snippet: 4157415549BF563412000000000049BD563412000000000049FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F415D415FC3
...

The predicted throughput from llvm-mca is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.

llvmbot commented 4 months ago

@llvm/issue-subscribers-tools-llvm-mca

Author: Aiden Grossman (boomanaiden154)

Take the small snippet: ```asm incq %r15 addq $0x4, %r13 cmpq $0x3f, %r15 ``` Running this through MCA on `skylake`/`skylake-avx512` produces the following: ``` Iterations: 100 Instructions: 300 Total Cycles: 104 Total uOps: 300 Dispatch Width: 6 uOps Per Cycle: 2.88 IPC: 2.88 Block RThroughput: 0.8 Instruction Info: [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 1 0.25 incq %r15 1 1 0.25 addq $4, %r13 1 1 0.25 cmpq $63, %r15 Resources: [0] - SKXDivider [1] - SKXFPDivider [2] - SKXPort0 [3] - SKXPort1 [4] - SKXPort2 [5] - SKXPort3 [6] - SKXPort4 [7] - SKXPort5 [8] - SKXPort6 [9] - SKXPort7 Resource pressure per iteration: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] - - 0.75 0.75 - - - 0.75 0.75 - Resource pressure by instruction: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions: - - 0.24 0.25 - - - 0.26 0.25 - incq %r15 - - 0.25 0.25 - - - 0.25 0.25 - addq $4, %r13 - - 0.26 0.25 - - - 0.24 0.25 - cmpq $63, %r15 ``` However, running this within `llvm-exegesis` (`llvm-exegesis -snippets-file=/tmp/test.s --mode=latency`) produces the following: ``` --- mode: latency key: instructions: - 'INC64r R15 R15' - 'ADD64ri8 R13 R13 i_0x4' - 'CMP64ri8 R15 i_0x3f' config: '' register_initial_values: - 'R15=0x123456' - 'R13=0x123456' cpu_name: skylake-avx512 llvm_triple: x86_64-grtev4-linux-gnu min_instructions: 10000 measurements: - { key: latency, value: 0.4234, per_snippet_value: 1.26995, validation_counters: {} } error: '' info: '' assembled_snippet: 4157415549BF563412000000000049BD563412000000000049FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F49FFC74983C5044983FF3F415D415FC3 ... ``` The predicted throughput from `llvm-mca` is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.
topperc commented 4 months ago

MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.

Naively that seems right to me. The incq on each iteration is dependent on the previous one The addq on each is iteration is dependent on the previous one. The cmpq depends on the inc. Nothing depends on the cmpq. There are 4 ALUs available to each operation.

On the first cycle the ALUs can do one incq and one addq. On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration. On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration. etc. ... At the very end we need to do one cmpq by itself.

The iterations per cycle for that should be very nearly 1.

boomanaiden154 commented 4 months ago

MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.

Yes. I misunderstood what the reciprocal throughput field was representing.

Naively that seems right to me. The incq on each iteration is dependent on the previous one The addq on each is iteration is dependent on the previous one. The cmpq depends on the inc. Nothing depends on the cmpq. There are 4 ALUs available to each operation.

On the first cycle the ALUs can do one incq and one addq. On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration. On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration. etc. ... At the very end we need to do one cmpq by itself.

The iterations per cycle for that should be very nearly 1.

Right. That all makes sense to me. Looking at all the scheduling information for these instructions, they seem correct to me. CMP64ri8 seems to use the default scheduling class (https://gist.github.com/boomanaiden154/6417e88d67a0facf7995447be74cf7bc) which seems odd to me, but other than that, everything looks good.

However, the benchmark clearly shows 1.25 cycles/iteration, and UICA supports that. I still haven't figured out why UICA is reporting numbers that are so different.

boomanaiden154 commented 4 months ago

I spoke with Andreas Abel about this issue, and the main bottleneck is non-optimal port assignment by the renamer. Looking at the UICA trace, there is an instruction about every iteration that is waiting to be dispatched as it gets assigned to the same port as another uop and their dispatch cycle would otherwise overlap. This bumps the reciprocal throughput up to 1.25 cycles per iteration.

Given llvm-mca only models instruction dispatched rather than predecode/uop issue, I don't think this is a trivial issue to fix.