-O3 does not apply auto-vectorization on X86-64 CPU

In README, there is a narrative in Performance section. " -O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches. "

Below table shows the single-threaded benchmark result I performed on the following environment.

CPU: Intel(R) 16-Core HT Xeon(R) CPU E5-2686 v4 @ 2.30GHz
MEM: 128GB
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)

Model	llama2.c (-O3)	llama2.c (-Ofast -march=native)
stories15M.bin	55.394990	148.192771
stories42M.bin	19.951490	48.201989
stories110M.bin	7.673327	18.418202
llama2 7B	0.126841	0.304579
llama2 7B w/ int8 quantization	0.363301	0.241617

It seems -O3 doesn’t apply auto-vectorization on matmul (the most compute heavy operation), copy matmul implementation to https://godbolt.org/ to verify that only -Ofast -march=native uses AVX instructions such as vaddps.

karpathy / llama2.c

-O3 does not apply auto-vectorization on X86-64 CPU #448