I was surprised by the lack of simple examples showing how to use AVX and AVX2 intrinsics. There doesn't seem to be a definitive book or even tutorial on the subject.
O Internet, if I am wrong, please correct me! I've learned that the best way to get information on the internet is not to ask a question, but to post the wrong answer.
So I'm setting myself homework problems and solving them as I go, producing a set of examples. This will be a long and painful process with many trips to the debugger and disassembler.
I will strive to keep the problems simple and the examples short, but I can't promise that a simple problem won't have a complex solution.
I hope I end up with a collection of self-contained code snippets that's useful to others. However, please remember that I am a beginner in the use of CPU vector instructions. I'm not claiming this code is exemplary.
Unfortunately, most of these resources are old. Others are raw reference materials like Intel's instruction guides.
https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions/ https://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices/ https://thinkingandcomputing.com/posts/using-avx-instructions-in-matrix-multiplication.html https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX https://software.intel.com/en-us/node/523876 https://www.cs.fsu.edu/~engelen/courses/HPC-adv/MMXandSSEexamples.txt http://sci.tuomastonteri.fi/programming/sse http://stackoverflow.com/questions/13577226/intel-sse-and-avx-examples-and-tutorials http://supercomputingblog.com/optimization/getting-started-with-sse-programming/ https://felix.abecassis.me/2011/09/cpp-getting-started-with-sse/ http://www.walkingrandomly.com/?p=3378 https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Most CPUs sold in the last 3-4 years should support AVX2. To find out, run
cat /proc/cpuinfo | grep avx2
on Linux, or
sysctl -a | grep AVX2
on Mac. Sorry Windows friends, I don't write programs on Windows but I would certainly appreciate a pull request that explains what to do.
It seems that the right header to include is immintrin.h
, which then
goes and includes individual headers like avx2intrin.h
. On my
installation, I get errors about unknown types if I include
avx2intrin.h
directly, as well as an error message saying "Never use
\<avx2intrin.h> directly; include \<immintrin.h> instead."
Okay, seems pretty clear.
The two platform combinations I develop on are clang+Mac
and
gcc+Linux
. On both of these, -mavx2
does the right thing and I get
no errors or non-AVX2 instructions in the output assembly.
A complete compilation command looks like
gcc foo.c -mavx2
See examples/00-compile.c
for a complete test program that should
compile if you have everything set up.
So it begins! We will construct a vector value from 4 64-bit literals and then add it to itself.
Not only will we assume the input is correctly aligned, but also that their lengths are multiples of 256 bits.
Are there necessary restrictions on alignment with respect to each other, or can we take any two arrays of float anywhere in memory?
Let's add a reduction to the mix
I have seen it asserted online that brute force linear search can beat binary search for arrays of size up to 10K. The calculations people give to support this claim involve vector instructions. Let's try writing a vectorized linear search.