p12tic / libsimdpp

Portable header-only C++ low level SIMD library
Boost Software License 1.0
1.24k stars 129 forks source link

Slowdown on several vector variations #147

Open TheTryton opened 4 years ago

TheTryton commented 4 years ago

[MSVC] (CPU: Xeon 1231v3) I noticed slowdown when I was experimenting with several combinations of matrix sizes and simdpp instruction sets. I implemented matrix multiplication in 3 different ways: plain C, unmodified simdpp, own modification of simdpp. All implementations resemble code below:

matrix<float, 4, 4> result;
    __m128 m2rows[4] =
    {
        _mm_load_ps(m2.row(0).data()),
        _mm_load_ps(m2.row(1).data()),
        _mm_load_ps(m2.row(2).data()),
        _mm_load_ps(m2.row(3).data()),
    };

    auto m1row = m1.begin();
    auto resultrow = result.begin();
    for (;m1row != m1.end(); m1row += 4, resultrow += 4)
    {
        __m128 m1row_v[4] =
        {
            _mm_load_ps1(m1row),
            _mm_load_ps1(m1row + 1),
            _mm_load_ps1(m1row + 2),
            _mm_load_ps1(m1row + 3)
        };

        __m128 temp_mul[4] =
        {
            _mm_mul_ps(m1row_v[0], m2rows[0]),
            _mm_mul_ps(m1row_v[1], m2rows[1]),
            _mm_mul_ps(m1row_v[2], m2rows[2]),
            _mm_mul_ps(m1row_v[3], m2rows[3]),
        };

        __m128 resultrow_v = _mm_add_ps(_mm_add_ps(temp_mul[0], temp_mul[1]), _mm_add_ps(temp_mul[2], temp_mul[3]));
        _mm_store_ps(resultrow, resultrow_v);
    }

    return result;

(Every structure is properly aligned and has valid size in order to load/store specific simd type) Below I will explain performance problems caused in each specific combination of matrix size and instruction set selected: Performance calculated as an average of 100000 iterations and 10 runs of every implementations. Compiled with /O2 and /Ob2 optimizations (Release).

Problems stated above are more common in simdpp implementation (eg. add, mul, sub functions). I haven't tested above code using G++ and Clang on my platform but I suppose some of these problems can still happen (eg. the one with _mm_broadcast_ss). Also I haven't tested performance of these matrix multiplications on AVX512F because I don't have a CPU supporting that instruction set but some of issues of similar kind may be present with use of this instruction set.

abique commented 4 years ago

It would be interesting to reproduce this benchmark with gcc and clang.

TheTryton commented 4 years ago

It would be interesting to reproduce this benchmark with gcc and clang.

I've tested them now and it seems that these slowdowns are not present on both gcc (9.3.0) and clang (9.0.1). From my investigation those slowdowns are mainly due to MSVC inferior optimization.