In README, there is a narrative in Performance section.
"
-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches.
"
Below table shows the single-threaded benchmark result I performed on the following environment.
CPU: Intel(R) 16-Core HT Xeon(R) CPU E5-2686 v4 @ 2.30GHz
MEM: 128GB
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Model
llama2.c (-O3)
llama2.c (-Ofast -march=native)
stories15M.bin
55.394990
148.192771
stories42M.bin
19.951490
48.201989
stories110M.bin
7.673327
18.418202
llama2 7B
0.126841
0.304579
llama2 7B w/ int8 quantization
0.363301
0.241617
It seems -O3 doesn’t apply auto-vectorization on matmul (the most compute heavy operation), copy matmul implementation to https://godbolt.org/ to verify that only -Ofast -march=native uses AVX instructions such as vaddps.
In README, there is a narrative in Performance section. " -O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches. "
Below table shows the single-threaded benchmark result I performed on the following environment.
It seems
-O3
doesn’t apply auto-vectorization onmatmul
(the most compute heavy operation), copymatmul
implementation to https://godbolt.org/ to verify that only-Ofast -march=native
uses AVX instructions such asvaddps
.