karpathy / llama2.c

Inference Llama 2 in one file of pure C
MIT License
17.48k stars 2.09k forks source link

-O3 does not apply auto-vectorization on X86-64 CPU #448

Open neoremind opened 1 year ago

neoremind commented 1 year ago

In README, there is a narrative in Performance section. " -O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches. "

Below table shows the single-threaded benchmark result I performed on the following environment.

CPU: Intel(R) 16-Core HT Xeon(R) CPU E5-2686 v4 @ 2.30GHz
MEM: 128GB
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Model llama2.c (-O3) llama2.c (-Ofast -march=native)
stories15M.bin 55.394990 148.192771
stories42M.bin 19.951490 48.201989
stories110M.bin 7.673327 18.418202
llama2 7B 0.126841 0.304579
llama2 7B w/ int8 quantization 0.363301 0.241617

It seems -O3 doesn’t apply auto-vectorization on matmul (the most compute heavy operation), copy matmul implementation to https://godbolt.org/ to verify that only -Ofast -march=native uses AVX instructions such as vaddps.

jameswdelancey commented 6 months ago

You need to also add -march=native with -O3 or -Ofast