Open ryao opened 1 year ago
The problem is mischaracterized (and not specific to Zen 3). The issue is that movb $30, %cl
is not a dependency-breaking instruction (AMD and recent Intels do not rename low 8-bit partial registers separately, and since the unstruction keeps bits 8-31 of %ecx unmodiifed, this creates a dependency on the previous value of %ecx). Since the movb
is heading a critical dependency chain, the issue is very noticeable.
You'll see the same speedup if you insert a dependency-breaking xor %ecx, %ecx
before the offending movb
.
@amonakov You are right. Inserting xor %ecx, %ecx
before movb also lets us save 1 byte of space versus using movl.
Would the solution be for LLVM to emit an xor before emitting movb whenever the previous value of the upper register is not needed?
Also, would it be expected that llvm-mca does not report a dependency issue here?
@llvm/issue-subscribers-backend-x86
Out of curiosity, I tested movw+subw. Performance was identical to movb+subb.
@amonakov You are right. Inserting
xor %ecx, %ecx
before movb also lets us save 1 byte of space versus using movl.Also, would it be expected that llvm-mca does not report a dependency issue here?
That’s not expected. On x86 we model partial register updates, and there are tests for it. If you think that there is a problem with it, then I suggest to file a separate mca bug.
smaller code size is not worth it when it kills performance on a popular AMD64 processor family.
So does it happen on Intel?
AMD and recent Intels do not rename low 8-bit partial registers separately
Is that a bug in silicon? Because popcnt false dependency was fixed by Intel on newer CPUs. Not fixed in clang yet, see https://github.com/llvm/llvm-project/issues/33216 Or did Intel make silicon closer to AMD?
Lately, I have been trying to micro-optimize a binary search function that operates on 4KB sized arrays. Doing some experiments with loop unrolling yielded three alternative implementations.
https://gcc.godbolt.org/z/79rq7sqcf
They all should perform similarly, yet on my Ryzen 7 5800X, they do not:
I apply the following patch to the assembly code. Then assemble and link the result:
Now, when I rerun the micro-benchmark, all 3 perform similarly:
If I change the patch to only change
subb
tosubl
, the performance remains unchanged.If I compile with GCC 12.2, I see an even better execution time on the last one:
That is unsuprising, considering that GCC emits fewer instructions for the switch statement body in v3, while it emits many more in v1 and v2. Here is the first case statement from GCC for v3:
And the first case statement from LLVM for v3:
In any case, passing
-mtune=znver3
does not stop LLVM from usingmovb
andsubb
. Whatever optimization pass is opportunistically lowering operations to byte operations should be made to stop doing that on AMD64.Slightly smaller code size is not worth it when it kills performance on a popular AMD64 processor family. In this case, using
movb
saves 3 bytes, whilesubb
andsubl
are the same number of bytes.