Closed RKSimon closed 5 months ago
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-llvm-transforms
Author: Simon Pilgrim (RKSimon)
I didn't really get why it makes sense to multiply the uop fetch rate with the mispredict penalty.
Also, while znver4 might in theory support 9 uops per cycle, isn't it limited by the 6 wide renamer in practice?
isn't it limited by the 6 wide renamer in practice
I am not sure why you mention renamer. It has a role to play in the throughput however it is not strictly limited to renamer's capacity right? Renamer's capability will be on case-case basis but theoretical limit will be 9uops. Also, renamer can be split across integer\fp(vector). So, I don't think we should restrict it with renamer's capability.
@RKSimon I think we should add a tuning flag whether a subtarget is willing to use this LoopMicroOpBufferSize for unrolling decision. I agree that the metric you are proposing is serving the purpose but the term LoopMicroOpBufferSize in itself is misleading and is not representative.
@RKSimon I think we should add a tuning flag whether a subtarget is willing to use this LoopMicroOpBufferSize for unrolling decision. I agree that the metric you are proposing is serving the purpose but the term LoopMicroOpBufferSize in itself is misleading and is not representative.
We shouldn't need a TLI/TTI control for this - either removing the LoopMicroOpBufferSize entry (see znver1/2) or explicitly setting it to 0 has a similar effect. But I'm not certain if we want to do this for znver3/4 or not - I don't have access to hardware to test this.
Would people prefer we just drop the LoopMicroOpBufferSize entry from the znver3/4 models (same as znver1/2)? This prevents most loop unrolling and we then rely on the cpu's op cache higher decode rate to get higher performance (but we end up testing every loop).
Would people prefer we just drop the LoopMicroOpBufferSize entry from the znver3/4 models (same as znver1/2)? This prevents most loop unrolling and we then rely on the cpu's op cache higher decode rate to get higher performance (but we end up testing every loop).
I have at least one counter example which gains with the LoopMicroOpBufferSize setting we have for znver3/4. Let us go by your deduction of the metric for LoopMicroOpBufferSize based on the misprediction penalty.
matrix_vector_mul.zip With -unroll-runtime, we can see the attached example gaining. So, let us have the metric @RKSimon suggested.
Thanks @ganeshgit - are you happy to accept this patch as it is then?
Thanks @ganeshgit - are you happy to accept this patch as it is then?
@RKSimon Sure thanks a lot! LGTM!
The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.
From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).
This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.