markhun / 2023W-EFFPROG

1 stars 0 forks source link

Compiler optimization #2

Open johannesfelzmann opened 6 months ago

johannesfelzmann commented 6 months ago

Attention: Not all optimizations are controlled directly by a flag. (more on that later)

For now using gcc

Main branch last commit for stats: f09e7abb9db914613d52914f904b646d25ca3c15

Profile with:

profile: magichex
    perf stat -e cycles:u -e instructions:u -e branches:u -e branch-misses:u -e L1-dcache-load-misses:u $(BIN_DIR)/magichex 4 3 14 33 30 34 39 6 24 20
johannesfelzmann commented 6 months ago

Using -Wall -O0:

Most optimizations are completely disabled at -O0

Reduce compilation time and make debugging produce the expected results. This is the default.

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  817376542210      cycles:u                                                           
 2373885593213      instructions:u                   #    2.90  insn per cycle         
  334975956080      branches:u                                                         
    2513459182      branch-misses:u                  #    0.75% of all branches        
        341609      L1-dcache-load-misses                                              

 174.538960494 seconds time elapsed

 174.223831000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -O1:

Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.

With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

-O turns on the following optimization flags:

-fauto-inc-dec -fbranch-count-reg -fcombine-stack-adjustments -fcompare-elim -fcprop-registers -fdce -fdefer-pop -fdelayed-branch -fdse -fforward-propagate -fguess-branch-probability -fif-conversion -fif-conversion2 -finline-functions-called-once -fipa-modref -fipa-profile -fipa-pure-const -fipa-reference -fipa-reference-addressable -fmerge-constants -fmove-loop-invariants -fmove-loop-stores -fomit-frame-pointer -freorder-blocks -fshrink-wrap -fshrink-wrap-separate -fsplit-wide-types -fssa-backprop -fssa-phiopt -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-phiprop -ftree-pta -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  325283970374      cycles:u                                                           
 1031621271606      instructions:u                   #    3.17  insn per cycle         
  312237946193      branches:u                                                         
    2010472438      branch-misses:u                  #    0.64% of all branches        
        130579      L1-dcache-load-misses                                              

  69.461152837 seconds time elapsed

  69.348018000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -O2:

Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.

-O2 turns on all optimization flags specified by -O1. It also turns on the following optimization flags:

-falign-functions -falign-jumps -falign-labels -falign-loops -fcaller-saves -fcode-hoisting -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdelete-null-pointer-checks -fdevirtualize -fdevirtualize-speculatively -fexpensive-optimizations -ffinite-loops -fgcse -fgcse-lm -fhoist-adjacent-loads -finline-functions -finline-small-functions -findirect-inlining -fipa-bit-cp -fipa-cp -fipa-icf -fipa-ra -fipa-sra -fipa-vrp -fisolate-erroneous-paths-dereference -flra-remat -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole2 -freorder-blocks-algorithm=stc -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fschedule-insns -fschedule-insns2 -fsched-interblock -fsched-spec -fstore-merging -fstrict-aliasing -fthread-jumps -ftree-builtin-call-dce -ftree-loop-vectorize -ftree-pre -ftree-slp-vectorize -ftree-switch-conversion -ftree-tail-merge -ftree-vrp -fvect-cost-model=very-cheap

Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos.

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  247816256820      cycles:u                                                           
  829295125025      instructions:u                   #    3.35  insn per cycle         
  237230072600      branches:u                                                         
    1886587120      branch-misses:u                  #    0.80% of all branches        
         86474      L1-dcache-load-misses                                              

  52.834204198 seconds time elapsed

  52.834134000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -O3:

Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:

-fgcse-after-reload -fipa-cp-clone -floop-interchange -floop-unroll-and-jam -fpeel-loops -fpredictive-commoning -fsplit-loops -fsplit-paths -ftree-loop-distribution -ftree-partial-pre -funswitch-loops -fvect-cost-model=dynamic -fversion-loops-for-strides

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  241075933903      cycles:u                                                           
  783341289802      instructions:u                   #    3.25  insn per cycle         
  223305947371      branches:u                                                         
    2035596752      branch-misses:u                  #    0.91% of all branches        
         97152      L1-dcache-load-misses                                              

  51.382573401 seconds time elapsed

  51.382597000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -O4:

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  241567620308      cycles:u                                                           
  783341289830      instructions:u                   #    3.24  insn per cycle         
  223305947399      branches:u                                                         
    2071550789      branch-misses:u                  #    0.93% of all branches        
        116037      L1-dcache-load-misses                                              

  51.502712639 seconds time elapsed

  51.502766000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -Os:

Optimize for size. -Os enables all -O2 optimizations except those that often increase code size:

-falign-functions
-falign-jumps -falign-labels
-falign-loops -fprefetch-loop-arrays
-freorder-blocks-algorithm=stc

It also enables -finline-functions, causes the compiler to tune for code size rather than execution speed, and performs further optimizations designed to reduce code size.

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  435653242040      cycles:u                                                           
 1077864954278      instructions:u                   #    2.47  insn per cycle         
  361459767992      branches:u                                                         
    2503122768      branch-misses:u                  #    0.69% of all branches        
        275109      L1-dcache-load-misses                                              

  92.988360805 seconds time elapsed

  92.860797000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -Ofast:

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. It turns off -fsemantic-interposition.

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  241244790946      cycles:u                                                           
  783347728964      instructions:u                   #    3.25  insn per cycle         
  223306442705      branches:u                                                         
    2006716844      branch-misses:u                  #    0.90% of all branches        
        101913      L1-dcache-load-misses                                              

  51.498639996 seconds time elapsed

  51.423188000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using compiler clang without any flags:

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  312892299451      cycles:u                                                           
 1090702502642      instructions:u                   #    3.49  insn per cycle         
  321766720583      branches:u                                                         
    1801181908      branch-misses:u                  #    0.56% of all branches        
        129846      L1-dcache-load-misses                                              

  66.706721299 seconds time elapsed

  66.706517000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -Og:

Optimize debugging experience. -Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience. It is a better choice than -O0 for producing debuggable code because some compiler passes that collect debug information are disabled at -O0.

Like -O0, -Og completely disables a number of optimization passes so that individual options controlling them have no effect. Otherwise -Og enables all -O1 optimization flags except for those that may interfere with debugging:

-fbranch-count-reg -fdelayed-branch -fdse -fif-conversion -fif-conversion2 -finline-functions-called-once -fmove-loop-invariants -fmove-loop-stores -fssa-phiopt -ftree-bit-ccp -ftree-dse -ftree-pta -ftree-sra

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  314952645563      cycles:u                                                           
 1090702502777      instructions:u                   #    3.46  insn per cycle         
  321766720718      branches:u                                                         
    1875384085      branch-misses:u                  #    0.58% of all branches        
        103970      L1-dcache-load-misses                                              

  67.126468533 seconds time elapsed

  67.122475000 seconds user
   0.003999000 seconds sys
johannesfelzmann commented 6 months ago

Using -Wall -ggdb -O3 -fno-omit-frame-pointer

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  234104166901      cycles:u                                                           
  784431849336      instructions:u                   #    3.35  insn per cycle         
  223305947014      branches:u                                                         
    1877173310      branch-misses:u                  #    0.84% of all branches        
        108678      L1-dcache-load-misses                                              

  50.035881801 seconds time elapsed

  49.897948000 seconds user
   0.000000000 seconds sys


Using -Wall -ggdb -Ofast -fno-omit-frame-pointer

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  234725740517      cycles:u                                                           
  784438288607      instructions:u                   #    3.34  insn per cycle         
  223306442457      branches:u                                                         
    1875301947      branch-misses:u                  #    0.84% of all branches        
        122519      L1-dcache-load-misses                                              

  50.195525351 seconds time elapsed

  50.038269000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -funroll-loops

Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop, -fweb and -frename-registers. It also turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.

Enabled by -fprofile-use and -fauto-profile.

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  196167778936      cycles:u                                                           
  687413029725      instructions:u                   #    3.50  insn per cycle         
  183449809317      branches:u                                                         
    1208025256      branch-misses:u                  #    0.66% of all branches        
         97585      L1-dcache-load-misses                                              

  41.821511466 seconds time elapsed

  41.818861000 seconds user
   0.000000000 seconds sys
johannesfelzmann commented 6 months ago

Using -funroll-all-loops

Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly. -funroll-all-loops implies the same options as -funroll-loops.

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  202095759166      cycles:u                                                           
  687413030034      instructions:u                   #    3.40  insn per cycle         
  183449809627      branches:u                                                         
    1261570116      branch-misses:u                  #    0.69% of all branches        
         90878      L1-dcache-load-misses                                              

  43.078263749 seconds time elapsed

  43.074216000 seconds user
   0.004000000 seconds sys
johannesfelzmann commented 6 months ago

Using -fsplit-loops -funroll-loops

===============================================================================================

40 solution(s), 15808871 leafs visited

Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':

  195567015001      cycles:u                                                           
  687419468908      instructions:u                   #    3.52  insn per cycle         
  183450304683      branches:u                                                         
    1206416344      branch-misses:u                  #    0.66% of all branches        
         99870      L1-dcache-load-misses                                              

  41.757088412 seconds time elapsed

  41.685353000 seconds user
   0.000000000 seconds sys