Open 39a49f0c-871e-4c51-9f00-984198e47929 opened 6 years ago
-force-precise-rotation-cost makes situation worse.
As it should be in theory.
Hello all, just FYI I have tested the same test-case on an aarch64 host and in this case -force-precise-rotation-cost makes situation worse. See below:
-O3: 16.6949 -O3 -fuse-profile=test.profdata: 16.7143 -O3 -fuse-profile=test.profdata -mllvm -force-precise-rotation-cost: 16.7963
$ perf stat -e branches:u ./test.fdo 4995000000000
Performance counter stats for './test.fdo':
1,280,525,995 branches:u
3.458138487 seconds time elapsed
$ perf stat -e branches:u ./test.fdo.rot 4995000000000
Performance counter stats for './test.fdo.rot':
1,280,525,975 branches:u
Actually the number of branches and number of taken branches can be computed out from source code. The tiny difference is in a[i]==1, but the probability is only 1/10M, same as loop end condition.
You should look at #of retired branch instructions that are taken.
Loop rotation does change the BB layout of hot code. But I can't understand how it affect performance.
fdo without rotation 0.06 : 400b70: mov 0x4(%rax,%rbx,4),%edi a[i] 0.22 : 400b74: mov %rdx,%rbx 0.00 : 400b77: cmp $0x1,%edi 0.00 : 400b7a: je 400c22 <main+0x122> a[i] == 1? 0.15 : 400b80: jg 400c0c <main+0x10c> a[i] > 1? 0.10 : 400b86: movq %rcx,%xmm4 0.00 : 400b8b: pslldq $0x8,%xmm4 0.00 : 400b90: pxor %xmm7,%xmm7 0.00 : 400b94: mov $0x3e8,%edi 0.08 : 400b99: pxor %xmm6,%xmm6 0.00 : 400b9d: nopl (%rax) 4.71 : 400ba0: movdqa %xmm4,%xmm0 ... // hot() 0.16 : 400bf2: add $0xfffffff8,%edi 0.00 : 400bf5: jne 400ba0 <main+0xa0> 0.55 : 400bf7: paddq %xmm7,%xmm6 0.66 : 400bfb: pshufd $0x4e,%xmm6,%xmm0 0.22 : 400c00: paddq %xmm6,%xmm0 0.09 : 400c04: movq %xmm0,%rdx 0.10 : 400c09: add %rdx,%rsi 0.08 : 400c0c: lea 0x1(%rbx),%rdx 0.00 : 400c10: cmp $0x989680,%rdx 0.00 : 400c17: jne 400b70 <main+0x70> // loop end?
fdo with rotation 0.10 : 400b70: jg 400bfc <main+0xfc> // a[i] > 1? 0.10 : 400b76: movq %rcx,%xmm4 0.00 : 400b7b: pslldq $0x8,%xmm4 0.02 : 400b80: pxor %xmm7,%xmm7 0.01 : 400b84: mov $0x3e8,%edi 0.10 : 400b89: pxor %xmm6,%xmm6 0.00 : 400b8d: nopl (%rax) 4.70 : 400b90: movdqa %xmm4,%xmm0 ... // hot() 0.44 : 400be2: add $0xfffffff8,%edi 0.00 : 400be5: jne 400b90 <main+0x90> 0.48 : 400be7: paddq %xmm7,%xmm6 0.68 : 400beb: pshufd $0x4e,%xmm6,%xmm0 0.23 : 400bf0: paddq %xmm6,%xmm0 0.08 : 400bf4: movq %xmm0,%rdx 0.10 : 400bf9: add %rdx,%rsi 0.12 : 400bfc: lea 0x1(%rbx),%rdx 0.00 : 400c00: cmp $0x989680,%rdx 0.00 : 400c07: je 400ca2 <main+0x1a2> // loop end? 0.07 : 400c0d: mov 0x4(%rax,%rbx,4),%edi 0.12 : 400c11: mov %rdx,%rbx 0.00 : 400c14: cmp $0x1,%edi 0.00 : 400c17: jne 400b70 <main+0x70> // a[i] == 1?
The rotate version moves the block containing comparing equal to 1 to the bottom of hot path, but it doesn't change the number of taken branches, the number of branch instructions or any other number of instructions. perf shows rotate version has less front end stalls
$ perf stat ./test.fdo.rot 4995000000000
Performance counter stats for './test.fdo.rot':
3051.571863 task-clock (msec) # 1.000 CPUs utilized
6 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
9,875 page-faults # 0.003 M/sec
9,623,276,881 cycles # 3.154 GHz (83.34%)
2,064,152,736 stalled-cycles-frontend # 21.45% frontend cycles idle (83.36%)
134,561,132 stalled-cycles-backend # 1.40% backend cycles idle (66.70%)
27,705,343,976 instructions # 2.88 insn per cycle
# 0.07 stalled cycles per insn (83.35%)
1,282,561,995 branches # 420.296 M/sec (83.35%)
10,020,162 branch-misses # 0.78% of all branches (83.27%)
3.052426172 seconds time elapsed
$ perf stat ./test.fdo 4995000000000
Performance counter stats for './test.fdo':
3259.637733 task-clock (msec) # 1.000 CPUs utilized
5 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
9,876 page-faults # 0.003 M/sec
9,898,963,712 cycles # 3.037 GHz (83.31%)
2,323,725,288 stalled-cycles-frontend # 23.47% frontend cycles idle (83.32%)
136,139,176 stalled-cycles-backend # 1.38% backend cycles idle (66.62%)
27,685,299,856 instructions # 2.80 insn per cycle
# 0.08 stalled cycles per insn (83.32%)
1,282,753,069 branches # 393.526 M/sec (83.42%)
10,027,361 branch-misses # 0.78% of all branches (83.37%)
3.260484104 seconds time elapsed
Could anybody help to explain the performance difference?
Carrot, please examine this case to verify if loop rotation is the cause of the problem. You can also test with your rotation patch.
I can reproduce the performance improvement with -mllvm -force-precise-rotation-cost option:
step:
clang++ -O3 -fprofile-generate test.cpp ./a.out llvm_profdata merge -o test.profdata default_5714532104544469349_0.profraw
// with new rotation mclang++ -O3 -mllvm -force-precise-rotation-cost -fprofile-use=./test.profdata -o test.ir.rot test.cpp
// without mclang++ -O3 -fprofile-use=./test.profdata -o test.ir test.cpp
time ./test.ir
4995000000000
real 0m2.656s user 0m2.648s sys 0m0.004s
time ./test.ir.rot 4995000000000
real 0m2.467s user 0m2.464s sys 0m0.000s
Extended Description
I've been testing PGO with the code from the thread: http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html
(Code is attached).
And it appeared that using PGO (either -fprofile-instr or -fprofile) lead to performance degradation. See the results of
{ for i in
seq 1 100; do time ./test > /dev/null ; done ; } 2>> perf.log
below:clang -O3: 3.12253 clang -O3 + -fprofile-use: 3.14982 clang -O3 + -fprofile-use + -mllvm -force-precise-rotation-cost : 3.14189 gcc -O3: 3.38019 gcc -O3 + -fprofile-use: 2.937
clang/llvm were built from release_60 branch, gcc version was 7.3.0.