vectorclass / version2

Vector class library, latest version
Apache License 2.0
1.3k stars 147 forks source link

Unexpect slowdown in basic example (provided) #5

Closed Wheest closed 4 years ago

Wheest commented 5 years ago

Hello all,

I'm looking at integrating the library into a project I've working on.

However, I want to make sure that I set off on the right foot.

Thus, I have made a very simple minimum working example (vector addition), using CMake and git submodules

You can find the MWE here, which I will improve in responses to this thread.

However, I'm finding an ~2x slowdown using SIMD, which is not what I would expect. I've taken steps to disable automatic vectorisation of my baseline, I think.

I've also not included initialisation in my timing.

Baseline: 1 ms
SIMD: 3 ms

Before integrating, I want to make sure I avoid stumbling blocks such as this.

Does anyone have any insight into what's going on?

dokempf commented 5 years ago

The reason seems to be the compiler flags you provide in your CMakeLists.txt. You are using the Vec8f type, which is a 256 Bit SIMD vector, but only enable -msse3. This will have your explicitly vectorized code use the emulation codepath, which does not deliver performance. Consider adding all relevant archflags by specifying e.g. -march=haswell or -march=skylake.

Wheest commented 5 years ago

Thanks that's a good tip.

I've updated the codebase, using the -march=native flag.

I also now perform each experiment 2000 times to better show any difference in speed.

Previous version, with the -msse3 flag:

Baseline: 1956 ms
SIMD: 5812 ms

New version, with the -mnative flag:

Baseline: 1976 ms
SIMD: 2353 ms

A great improvement, but still slower than our baseline. I've added and removed a few other flags, however there isn't a change.

AgnerF commented 5 years ago

Your vector version has three additional arrays avec, bvec, resvec, This means you are copying all data around an extra time and using more data cache. The vector class objects are intended to stay in registers all the time.

This should be faster:

Vec8f a, b, c;

for (int i = 0; i < SIZE; i += 8)    
{
    a.load(&vec_a[i]);
    b.load(&vec_b[i]);
    c = a + b;
    c.store(&resvec[i])
}

Please remember to specify which compiler you are using, and the command line options.

Wheest commented 5 years ago

Conceptually, the technique of having a small number of vectorclass objects existing in cache, and being copied to as needed makes sense.

However, using this approach I see a significant slowdown:

Baseline: 1974 ms
SIMD: 7893 ms

You can see the new version at the HEAD of the example, and the previous version here at a7588.

I'm compiling with clang-7.

I use the -march=native flag, which should activate all SSE extensions available on the x86 platform I am using.

To disable auto-vectorisation of the scalar baseline loop, I added the following flags, as described in the LLVM docs

-fno-vectorize -fno-tree-vectorize -fno-slp-vectorize

The build process is still as described in the README:

git clone --recurse-submodules https://github.com/Wheest/vectorclass_mwe
mkdir -p vectorclass_mwe/_build
cd vectorclass_mwe/_build
cmake ..
make
./simd_mwe

Building using the CMake flag -DCMAKE_BUILD_TYPE=Debug to activate -O0 does not change the times.

AgnerF commented 5 years ago

Why do you have the vector loop twice? Please compile with -O2 or -O3

Wheest commented 5 years ago

Why do you have the vector loop twice?

A mistake on my part, fixed in the new commit.

Compiling with the -03 flag (via the -DCMAKE_BUILD_TYPE=Release flag), we see the following times:

Baseline: 515 ms
SIMD: 539 ms

This is a great speedup, but the baseline is still faster. However, it is possible that with -03, my disable auto-vectorisation flags are being ignored. However, this document makes me suspect this is not the case.

-fvectorize, -fno-vectorize: Enables or disables the generation of Advanced SIMD vector instructions directly from C or C++ code at optimization levels -O1 and higher.

AgnerF commented 5 years ago

You have three arrays of 1.2 MB each = 3.6 MB. This is probably bigger than your level-2 data cache. Cache access or memory access is the bottleneck, not CPU throughput.

You may want to check what the compiler is doing by looking at assembly output (option -S). If you dislike the AT&T assembly syntax, you can make an object file (option -c) and disassemble it with objconv (https://www.agner.org/optimize/#objconv).