Open boomanaiden154 opened 4 months ago
@llvm/issue-subscribers-tools-llvm-mca
Author: Aiden Grossman (boomanaiden154)
MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.
Naively that seems right to me. The incq on each iteration is dependent on the previous one The addq on each is iteration is dependent on the previous one. The cmpq depends on the inc. Nothing depends on the cmpq. There are 4 ALUs available to each operation.
On the first cycle the ALUs can do one incq and one addq. On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration. On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration. etc. ... At the very end we need to do one cmpq by itself.
The iterations per cycle for that should be very nearly 1.
MCA is reporting iterations/cycle of nearly 1.0 right? 100 iterations / 104 cycles.
Yes. I misunderstood what the reciprocal throughput field was representing.
Naively that seems right to me. The incq on each iteration is dependent on the previous one The addq on each is iteration is dependent on the previous one. The cmpq depends on the inc. Nothing depends on the cmpq. There are 4 ALUs available to each operation.
On the first cycle the ALUs can do one incq and one addq. On the second cycle the ALUs can do incq and addq the second iteration and the cmpq from the first iteration. On the third cycle the ALUs can do incq and addq the third iteration and the cmpq from the second iteration. etc. ... At the very end we need to do one cmpq by itself.
The iterations per cycle for that should be very nearly 1.
Right. That all makes sense to me. Looking at all the scheduling information for these instructions, they seem correct to me. CMP64ri8
seems to use the default scheduling class (https://gist.github.com/boomanaiden154/6417e88d67a0facf7995447be74cf7bc) which seems odd to me, but other than that, everything looks good.
However, the benchmark clearly shows 1.25 cycles/iteration, and UICA supports that. I still haven't figured out why UICA is reporting numbers that are so different.
I spoke with Andreas Abel about this issue, and the main bottleneck is non-optimal port assignment by the renamer. Looking at the UICA trace, there is an instruction about every iteration that is waiting to be dispatched as it gets assigned to the same port as another uop and their dispatch cycle would otherwise overlap. This bumps the reciprocal throughput up to 1.25 cycles per iteration.
Given llvm-mca only models instruction dispatched rather than predecode/uop issue, I don't think this is a trivial issue to fix.
Take the small snippet:
Running this through MCA on
skylake
/skylake-avx512
produces the following:However, running this within
llvm-exegesis
(llvm-exegesis -snippets-file=/tmp/test.s --mode=latency
) produces the following:The predicted throughput from
llvm-mca
is almost 40% less than the experimental value. UICA seems to agree with the experimental value, predicting 1.25 cycles/iteration as the reciprocal throughput.