Open johannesfelzmann opened 6 months ago
Using -Wall -O0:
Most optimizations are completely disabled at -O0
Reduce compilation time and make debugging produce the expected results. This is the default.
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
817376542210 cycles:u
2373885593213 instructions:u # 2.90 insn per cycle
334975956080 branches:u
2513459182 branch-misses:u # 0.75% of all branches
341609 L1-dcache-load-misses
174.538960494 seconds time elapsed
174.223831000 seconds user
0.000000000 seconds sys
Using -O1:
Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.
With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
-O turns on the following optimization flags:
-fauto-inc-dec -fbranch-count-reg -fcombine-stack-adjustments -fcompare-elim -fcprop-registers -fdce -fdefer-pop -fdelayed-branch -fdse -fforward-propagate -fguess-branch-probability -fif-conversion -fif-conversion2 -finline-functions-called-once -fipa-modref -fipa-profile -fipa-pure-const -fipa-reference -fipa-reference-addressable -fmerge-constants -fmove-loop-invariants -fmove-loop-stores -fomit-frame-pointer -freorder-blocks -fshrink-wrap -fshrink-wrap-separate -fsplit-wide-types -fssa-backprop -fssa-phiopt -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-phiprop -ftree-pta -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
325283970374 cycles:u
1031621271606 instructions:u # 3.17 insn per cycle
312237946193 branches:u
2010472438 branch-misses:u # 0.64% of all branches
130579 L1-dcache-load-misses
69.461152837 seconds time elapsed
69.348018000 seconds user
0.000000000 seconds sys
Using -O2:
Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
-O2 turns on all optimization flags specified by -O1. It also turns on the following optimization flags:
-falign-functions -falign-jumps -falign-labels -falign-loops -fcaller-saves -fcode-hoisting -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdelete-null-pointer-checks -fdevirtualize -fdevirtualize-speculatively -fexpensive-optimizations -ffinite-loops -fgcse -fgcse-lm -fhoist-adjacent-loads -finline-functions -finline-small-functions -findirect-inlining -fipa-bit-cp -fipa-cp -fipa-icf -fipa-ra -fipa-sra -fipa-vrp -fisolate-erroneous-paths-dereference -flra-remat -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole2 -freorder-blocks-algorithm=stc -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fschedule-insns -fschedule-insns2 -fsched-interblock -fsched-spec -fstore-merging -fstrict-aliasing -fthread-jumps -ftree-builtin-call-dce -ftree-loop-vectorize -ftree-pre -ftree-slp-vectorize -ftree-switch-conversion -ftree-tail-merge -ftree-vrp -fvect-cost-model=very-cheap
Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos.
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
247816256820 cycles:u
829295125025 instructions:u # 3.35 insn per cycle
237230072600 branches:u
1886587120 branch-misses:u # 0.80% of all branches
86474 L1-dcache-load-misses
52.834204198 seconds time elapsed
52.834134000 seconds user
0.000000000 seconds sys
Using -O3:
Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags:
-fgcse-after-reload -fipa-cp-clone -floop-interchange -floop-unroll-and-jam -fpeel-loops -fpredictive-commoning -fsplit-loops -fsplit-paths -ftree-loop-distribution -ftree-partial-pre -funswitch-loops -fvect-cost-model=dynamic -fversion-loops-for-strides
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
241075933903 cycles:u
783341289802 instructions:u # 3.25 insn per cycle
223305947371 branches:u
2035596752 branch-misses:u # 0.91% of all branches
97152 L1-dcache-load-misses
51.382573401 seconds time elapsed
51.382597000 seconds user
0.000000000 seconds sys
Using -O4:
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
241567620308 cycles:u
783341289830 instructions:u # 3.24 insn per cycle
223305947399 branches:u
2071550789 branch-misses:u # 0.93% of all branches
116037 L1-dcache-load-misses
51.502712639 seconds time elapsed
51.502766000 seconds user
0.000000000 seconds sys
Using -Os:
Optimize for size. -Os enables all -O2 optimizations except those that often increase code size:
-falign-functions
-falign-jumps -falign-labels
-falign-loops -fprefetch-loop-arrays
-freorder-blocks-algorithm=stc
It also enables -finline-functions, causes the compiler to tune for code size rather than execution speed, and performs further optimizations designed to reduce code size.
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
435653242040 cycles:u
1077864954278 instructions:u # 2.47 insn per cycle
361459767992 branches:u
2503122768 branch-misses:u # 0.69% of all branches
275109 L1-dcache-load-misses
92.988360805 seconds time elapsed
92.860797000 seconds user
0.000000000 seconds sys
Using -Ofast:
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. It turns off -fsemantic-interposition.
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
241244790946 cycles:u
783347728964 instructions:u # 3.25 insn per cycle
223306442705 branches:u
2006716844 branch-misses:u # 0.90% of all branches
101913 L1-dcache-load-misses
51.498639996 seconds time elapsed
51.423188000 seconds user
0.000000000 seconds sys
Using compiler clang without any flags:
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
312892299451 cycles:u
1090702502642 instructions:u # 3.49 insn per cycle
321766720583 branches:u
1801181908 branch-misses:u # 0.56% of all branches
129846 L1-dcache-load-misses
66.706721299 seconds time elapsed
66.706517000 seconds user
0.000000000 seconds sys
Using -Og:
Optimize debugging experience. -Og should be the optimization level of choice for the standard edit-compile-debug cycle, offering a reasonable level of optimization while maintaining fast compilation and a good debugging experience. It is a better choice than -O0 for producing debuggable code because some compiler passes that collect debug information are disabled at -O0.
Like -O0, -Og completely disables a number of optimization passes so that individual options controlling them have no effect. Otherwise -Og enables all -O1 optimization flags except for those that may interfere with debugging:
-fbranch-count-reg -fdelayed-branch -fdse -fif-conversion -fif-conversion2 -finline-functions-called-once -fmove-loop-invariants -fmove-loop-stores -fssa-phiopt -ftree-bit-ccp -ftree-dse -ftree-pta -ftree-sra
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
314952645563 cycles:u
1090702502777 instructions:u # 3.46 insn per cycle
321766720718 branches:u
1875384085 branch-misses:u # 0.58% of all branches
103970 L1-dcache-load-misses
67.126468533 seconds time elapsed
67.122475000 seconds user
0.003999000 seconds sys
Using -Wall -ggdb -O3 -fno-omit-frame-pointer
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
234104166901 cycles:u
784431849336 instructions:u # 3.35 insn per cycle
223305947014 branches:u
1877173310 branch-misses:u # 0.84% of all branches
108678 L1-dcache-load-misses
50.035881801 seconds time elapsed
49.897948000 seconds user
0.000000000 seconds sys
Using -Wall -ggdb -Ofast -fno-omit-frame-pointer
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
234725740517 cycles:u
784438288607 instructions:u # 3.34 insn per cycle
223306442457 branches:u
1875301947 branch-misses:u # 0.84% of all branches
122519 L1-dcache-load-misses
50.195525351 seconds time elapsed
50.038269000 seconds user
0.000000000 seconds sys
Using -funroll-loops
Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop, -fweb and -frename-registers. It also turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.
Enabled by -fprofile-use and -fauto-profile.
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
196167778936 cycles:u
687413029725 instructions:u # 3.50 insn per cycle
183449809317 branches:u
1208025256 branch-misses:u # 0.66% of all branches
97585 L1-dcache-load-misses
41.821511466 seconds time elapsed
41.818861000 seconds user
0.000000000 seconds sys
Using -funroll-all-loops
Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly. -funroll-all-loops implies the same options as -funroll-loops.
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
202095759166 cycles:u
687413030034 instructions:u # 3.40 insn per cycle
183449809627 branches:u
1261570116 branch-misses:u # 0.69% of all branches
90878 L1-dcache-load-misses
43.078263749 seconds time elapsed
43.074216000 seconds user
0.004000000 seconds sys
Using -fsplit-loops -funroll-loops
===============================================================================================
40 solution(s), 15808871 leafs visited
Performance counter stats for './bin/magichex 4 3 14 33 30 34 39 6 24 20':
195567015001 cycles:u
687419468908 instructions:u # 3.52 insn per cycle
183450304683 branches:u
1206416344 branch-misses:u # 0.66% of all branches
99870 L1-dcache-load-misses
41.757088412 seconds time elapsed
41.685353000 seconds user
0.000000000 seconds sys
Attention: Not all optimizations are controlled directly by a flag. (more on that later)
For now using gcc
Main branch last commit for stats: f09e7abb9db914613d52914f904b646d25ca3c15
Profile with: