Closed Wheest closed 4 years ago
The reason seems to be the compiler flags you provide in your CMakeLists.txt. You are using the Vec8f
type, which is a 256 Bit SIMD vector, but only enable -msse3
. This will have your explicitly vectorized code use the emulation codepath, which does not deliver performance. Consider adding all relevant archflags by specifying e.g. -march=haswell
or -march=skylake
.
Thanks that's a good tip.
I've updated the codebase, using the -march=native
flag.
I also now perform each experiment 2000 times to better show any difference in speed.
Previous version, with the -msse3
flag:
Baseline: 1956 ms
SIMD: 5812 ms
New version, with the -mnative
flag:
Baseline: 1976 ms
SIMD: 2353 ms
A great improvement, but still slower than our baseline. I've added and removed a few other flags, however there isn't a change.
Your vector version has three additional arrays avec, bvec, resvec, This means you are copying all data around an extra time and using more data cache. The vector class objects are intended to stay in registers all the time.
This should be faster:
Vec8f a, b, c;
for (int i = 0; i < SIZE; i += 8)
{
a.load(&vec_a[i]);
b.load(&vec_b[i]);
c = a + b;
c.store(&resvec[i])
}
Please remember to specify which compiler you are using, and the command line options.
Conceptually, the technique of having a small number of vectorclass objects existing in cache, and being copied to as needed makes sense.
However, using this approach I see a significant slowdown:
Baseline: 1974 ms
SIMD: 7893 ms
You can see the new version at the HEAD of the example, and the previous version here at a7588.
I'm compiling with clang-7.
I use the -march=native
flag, which should activate all SSE extensions available on the x86 platform I am using.
To disable auto-vectorisation of the scalar baseline loop, I added the following flags, as described in the LLVM docs
-fno-vectorize -fno-tree-vectorize -fno-slp-vectorize
The build process is still as described in the README:
git clone --recurse-submodules https://github.com/Wheest/vectorclass_mwe
mkdir -p vectorclass_mwe/_build
cd vectorclass_mwe/_build
cmake ..
make
./simd_mwe
Building using the CMake flag -DCMAKE_BUILD_TYPE=Debug
to activate -O0
does not change the times.
Why do you have the vector loop twice? Please compile with -O2 or -O3
Why do you have the vector loop twice?
A mistake on my part, fixed in the new commit.
Compiling with the -03
flag (via the -DCMAKE_BUILD_TYPE=Release
flag), we see the following times:
Baseline: 515 ms
SIMD: 539 ms
This is a great speedup, but the baseline is still faster. However, it is possible that with -03
, my disable auto-vectorisation flags are being ignored. However, this document makes me suspect this is not the case.
-fvectorize, -fno-vectorize: Enables or disables the generation of Advanced SIMD vector instructions directly from C or C++ code at optimization levels -O1 and higher.
You have three arrays of 1.2 MB each = 3.6 MB. This is probably bigger than your level-2 data cache. Cache access or memory access is the bottleneck, not CPU throughput.
You may want to check what the compiler is doing by looking at assembly output (option -S). If you dislike the AT&T assembly syntax, you can make an object file (option -c) and disassemble it with objconv (https://www.agner.org/optimize/#objconv).
Hello all,
I'm looking at integrating the library into a project I've working on.
However, I want to make sure that I set off on the right foot.
Thus, I have made a very simple minimum working example (vector addition), using CMake and git submodules
You can find the MWE here, which I will improve in responses to this thread.
However, I'm finding an ~2x slowdown using SIMD, which is not what I would expect. I've taken steps to disable automatic vectorisation of my baseline, I think.
I've also not included initialisation in my timing.
Before integrating, I want to make sure I avoid stumbling blocks such as this.
Does anyone have any insight into what's going on?