PGO leads to performance degradation

39a49f0c-871e-4c51-9f00-984198e47929 commented 6 years ago


Bugzilla Link	36303
Version	6.0
OS	All
Attachments	Code sample
CC	@weiguozhi,@vns-mn,@Theodor,@vleschuk

Extended Description

I've been testing PGO with the code from the thread: http://lists.llvm.org/pipermail/llvm-dev/2016-May/099395.html

(Code is attached).

And it appeared that using PGO (either -fprofile-instr or -fprofile) lead to performance degradation. See the results of { for i inseq 1 100; do time ./test > /dev/null ; done ; } 2>> perf.log below:

clang -O3: 3.12253 clang -O3 + -fprofile-use: 3.14982 clang -O3 + -fprofile-use + -mllvm -force-precise-rotation-cost : 3.14189 gcc -O3: 3.38019 gcc -O3 + -fprofile-use: 2.937

clang/llvm were built from release_60 branch, gcc version was 7.3.0.

39a49f0c-871e-4c51-9f00-984198e47929 commented 6 years ago

-force-precise-rotation-cost makes situation worse.

As it should be in theory.

39a49f0c-871e-4c51-9f00-984198e47929 commented 6 years ago

Hello all, just FYI I have tested the same test-case on an aarch64 host and in this case -force-precise-rotation-cost makes situation worse. See below:

-O3: 16.6949 -O3 -fuse-profile=test.profdata: 16.7143 -O3 -fuse-profile=test.profdata -mllvm -force-precise-rotation-cost: 16.7963

weiguozhi commented 6 years ago

$ perf stat -e branches:u ./test.fdo 4995000000000

Performance counter stats for './test.fdo':

 1,280,525,995      branches:u                                                  

   3.458138487 seconds time elapsed

$ perf stat -e branches:u ./test.fdo.rot 4995000000000

Performance counter stats for './test.fdo.rot':

 1,280,525,975      branches:u

Actually the number of branches and number of taken branches can be computed out from source code. The tiny difference is in a[i]==1, but the probability is only 1/10M, same as loop end condition.

david-xl commented 6 years ago

You should look at #of retired branch instructions that are taken.

weiguozhi commented 6 years ago

Loop rotation does change the BB layout of hot code. But I can't understand how it affect performance.

fdo without rotation 0.06 : 400b70: mov 0x4(%rax,%rbx,4),%edi a[i] 0.22 : 400b74: mov %rdx,%rbx 0.00 : 400b77: cmp $0x1,%edi 0.00 : 400b7a: je 400c22 <main+0x122> a[i] == 1? 0.15 : 400b80: jg 400c0c <main+0x10c> a[i] > 1? 0.10 : 400b86: movq %rcx,%xmm4 0.00 : 400b8b: pslldq $0x8,%xmm4 0.00 : 400b90: pxor %xmm7,%xmm7 0.00 : 400b94: mov $0x3e8,%edi 0.08 : 400b99: pxor %xmm6,%xmm6 0.00 : 400b9d: nopl (%rax) 4.71 : 400ba0: movdqa %xmm4,%xmm0 ... // hot() 0.16 : 400bf2: add $0xfffffff8,%edi 0.00 : 400bf5: jne 400ba0 <main+0xa0> 0.55 : 400bf7: paddq %xmm7,%xmm6 0.66 : 400bfb: pshufd $0x4e,%xmm6,%xmm0 0.22 : 400c00: paddq %xmm6,%xmm0 0.09 : 400c04: movq %xmm0,%rdx 0.10 : 400c09: add %rdx,%rsi 0.08 : 400c0c: lea 0x1(%rbx),%rdx 0.00 : 400c10: cmp $0x989680,%rdx 0.00 : 400c17: jne 400b70 <main+0x70> // loop end?

fdo with rotation 0.10 : 400b70: jg 400bfc <main+0xfc> // a[i] > 1? 0.10 : 400b76: movq %rcx,%xmm4 0.00 : 400b7b: pslldq $0x8,%xmm4 0.02 : 400b80: pxor %xmm7,%xmm7 0.01 : 400b84: mov $0x3e8,%edi 0.10 : 400b89: pxor %xmm6,%xmm6 0.00 : 400b8d: nopl (%rax) 4.70 : 400b90: movdqa %xmm4,%xmm0 ... // hot() 0.44 : 400be2: add $0xfffffff8,%edi 0.00 : 400be5: jne 400b90 <main+0x90> 0.48 : 400be7: paddq %xmm7,%xmm6 0.68 : 400beb: pshufd $0x4e,%xmm6,%xmm0 0.23 : 400bf0: paddq %xmm6,%xmm0 0.08 : 400bf4: movq %xmm0,%rdx 0.10 : 400bf9: add %rdx,%rsi 0.12 : 400bfc: lea 0x1(%rbx),%rdx 0.00 : 400c00: cmp $0x989680,%rdx 0.00 : 400c07: je 400ca2 <main+0x1a2> // loop end? 0.07 : 400c0d: mov 0x4(%rax,%rbx,4),%edi 0.12 : 400c11: mov %rdx,%rbx 0.00 : 400c14: cmp $0x1,%edi 0.00 : 400c17: jne 400b70 <main+0x70> // a[i] == 1?

The rotate version moves the block containing comparing equal to 1 to the bottom of hot path, but it doesn't change the number of taken branches, the number of branch instructions or any other number of instructions. perf shows rotate version has less front end stalls

$ perf stat ./test.fdo.rot 4995000000000

Performance counter stats for './test.fdo.rot':

   3051.571863      task-clock (msec)         #    1.000 CPUs utilized          
             6      context-switches          #    0.002 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
         9,875      page-faults               #    0.003 M/sec                  
 9,623,276,881      cycles                    #    3.154 GHz                      (83.34%)
 2,064,152,736      stalled-cycles-frontend   #   21.45% frontend cycles idle     (83.36%)
   134,561,132      stalled-cycles-backend    #    1.40% backend cycles idle      (66.70%)
27,705,343,976      instructions              #    2.88  insn per cycle         
                                              #    0.07  stalled cycles per insn  (83.35%)
 1,282,561,995      branches                  #  420.296 M/sec                    (83.35%)
    10,020,162      branch-misses             #    0.78% of all branches          (83.27%)

   3.052426172 seconds time elapsed

$ perf stat ./test.fdo 4995000000000

Performance counter stats for './test.fdo':

   3259.637733      task-clock (msec)         #    1.000 CPUs utilized          
             5      context-switches          #    0.002 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
         9,876      page-faults               #    0.003 M/sec                  
 9,898,963,712      cycles                    #    3.037 GHz                      (83.31%)
 2,323,725,288      stalled-cycles-frontend   #   23.47% frontend cycles idle     (83.32%)
   136,139,176      stalled-cycles-backend    #    1.38% backend cycles idle      (66.62%)
27,685,299,856      instructions              #    2.80  insn per cycle         
                                              #    0.08  stalled cycles per insn  (83.32%)
 1,282,753,069      branches                  #  393.526 M/sec                    (83.42%)
    10,027,361      branch-misses             #    0.78% of all branches          (83.37%)

   3.260484104 seconds time elapsed

Could anybody help to explain the performance difference?

david-xl commented 6 years ago

Carrot, please examine this case to verify if loop rotation is the cause of the problem. You can also test with your rotation patch.

david-xl commented 6 years ago

I can reproduce the performance improvement with -mllvm -force-precise-rotation-cost option:

step:

clang++ -O3 -fprofile-generate test.cpp ./a.out llvm_profdata merge -o test.profdata default_5714532104544469349_0.profraw

// with new rotation mclang++ -O3 -mllvm -force-precise-rotation-cost -fprofile-use=./test.profdata -o test.ir.rot test.cpp

// without mclang++ -O3 -fprofile-use=./test.profdata -o test.ir test.cpp

time ./test.ir

4995000000000

real 0m2.656s user 0m2.648s sys 0m0.004s

time ./test.ir.rot 4995000000000

real 0m2.467s user 0m2.464s sys 0m0.000s

llvm / llvm-project

PGO leads to performance degradation #35651

Extended Description