SIMD acceleration - Githubissues

Wunkolo commented 5 years ago

A lot of the core critical C code in relation to image and block manipulation is implemented in a very serial matter though processors these days allow for vectorized instruction sets where multiple fragments of data can be processed in parallel rather than one at a time. The x64 instruction set architecture includes up to SSE2 and I feel that a lot of the C code can take advantage of these instruction sets though SIMD intrinsics exposed through headers like immintrin.h and such to possibly get huge improvements in speed for certain operations like image blitting and block/tile iteration and bit/byte manipulation

I might look to contribute once I get a good feel of the code base but I am wondering how you would feel about having PRs that use these intrinsics to vectorize and speed up critical sections of code as this would make some assumptions about the minimum requirements for the end-user. In particular it will require that the user be on a 64 bit computer which is kind of reasonable in 2019 or it can be something determined at compile time so that these intrinsics are not used unless the user opts in for them(Which in that case, will allow the user to opt-in for even more vectorization extensions such as AVX2 or AVX512).

CounterPillow commented 5 years ago

Determine it at compile time, juggling around function pointers of hot functions at runtime based on cpuid is just going to prevent inlining, which is bad. Personally I'd either target SSSE3 or AVX2. We still support 32-bit x86 for the time being, and we also support non-x86 based architectures such as ARM or AArch64.

I won't merge anything that isn't properly profiled and benchmarked though to ensure that it actually does make things faster.

Wunkolo commented 5 years ago

Sounds fair, gcc and clang both provide compile-time preprocessor definitions for the target architecture, though windows only defines them up to AVX2 I think

gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX"
#define __AVX512F__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __AVX512BW__ 1
#define __SSE2_MATH__ 1
#define __AVX__ 1
#define __AVX512VL__ 1
#define __AVX512CD__ 1
#define __AVX2__ 1
#define __SSE_MATH__ 1
#define __AVX512DQ__ 1
#define __SSE2__ 1
#define __SSSE3__ 1
#define __SSE__ 1
#define __SSE3__ 1

Though it will expose the code to some "preprocessor smell" the more it scales across functions and architectures if you are also fine with that(with mitigations in C++ but in C the only metaprogramming at your disposal is preprocessor-hell anyways).

Wunkolo commented 5 years ago

As a quick example and cast study, here's how something like the is_stairs function would be vectorized which is called around multiple times in critical code in iterate.c Benchmarks show the vectorized version is about x2 faster than the serial method

#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h>
#endif

int is_stairs(int block) {
    /*
     * Determines if a block is stairs of any material
     */
    #ifdef __SSE2__
    const __m128i stair_blocks = _mm_set_epi8(
         53, /* oak wood stairs */
         67, /* cobblestone stairs */
        108, /* brick stairs */
        109, /* stone brick stairs */
        114, /* nether brick stairs */
        128, /* sandstone stairs */
        134, /* spruce wood stairs */
        135, /* birch wood stairs */
        136, /* jungle wood stairs */
        156, /* quartz stairs */
        163, /* acacia wood stairs */
        164, /* dark wood stairs */
        180, /* red sandstone stairs */
        203, /* purpur stairs */
        203,// Just repeating the last element
        203 // as "filler"
    );
    const __m128i block_vec = _mm_set1_epi8(block);
    return !!_mm_movemask_epi8(_mm_cmpeq_epi8(stair_blocks,block_vec));
    #else
    switch (block) {
        case  53: /* oak wood stairs */
        case  67: /* cobblestone stairs */
        case 108: /* brick stairs */
        case 109: /* stone brick stairs */
        case 114: /* nether brick stairs */
        case 128: /* sandstone stairs */
        case 134: /* spruce wood stairs */
        case 135: /* birch wood stairs */
        case 136: /* jungle wood stairs */
        case 156: /* quartz stairs */
        case 163: /* acacia wood stairs */
        case 164: /* dark wood stairs */
        case 180: /* red sandstone stairs */
        case 203: /* purpur stairs */
            return 1;
    }
    return 0;
    #endif
}

Here's how it compiles across different platforms, pretty much all x64 compilers emit the SSE2 vectorized version and the 32-bit and ARM64 compilers fall back to the case-switch method. A NEON-accelerated implementation would add another #else if to the include chain and within is_stairs. These vectorized checks would benefit a lot of these huge number == 4, number == 52, number == 67, number = ... chains that are found up and down a lot of the C code. Is this a pattern you're fine with?

CounterPillow commented 5 years ago

Can we somehow have this be a macro so we don't need to duplicate the stair IDs, or is this too much preprocessor magic?

Wunkolo commented 5 years ago

I have an idea to approach this. For the mass-case-switch enum checks that determine if something is within a certain set of blocks there can be instead static uint8_t block-set arrays and a block_in_set sort of function that checks if it is found within it, that way all the vectorization and speedups can exist in one place by keeping things functional.

#include <stdint.h>
#include <stddef.h>
#define count_of(arr) (sizeof(arr) / sizeof(arr[0]))
const static uint8_t block_set_stairs[] = { 53, 67,108,109,114,128,134,135,136,156,163,164,180,203};

int block_in_set(
    uint8_t block,
    uint8_t* block_set, size_t block_set_size
)
{
    size_t i;
    for( i = 0; i < block_set_size; ++i )
    {
        if( block == block_set[i])
        {
            return 1;
        }
    }
    return 0;
}

/// Somewhere in your code where you need to check if the block is a stair block
    if( block_in_set(stage->block,block_set_stairs,count_of(block_set_stairs)) ) {
    ...

The current disassembly of the compiled C code with expansive if-statements creates incredibly vast forests of cmps and jmps due to guaranteed short-circuiting which totally thrash a CPU's branch predictor and blows up the assembly code. But with a functional approach like this it would all re-use the same assembly and allow the compiler to make some cleaner optimization choices.

CounterPillow commented 5 years ago

I've done some research on what is widely supported these days, to determine which SIMD level we could require as the minimum in our prebuilt binaries.

Since 2018, Windows 7 apparently requires SSE2 even for 32-bit builds. This means it may be safe to assume SSE2 on Win32 as well, considering we probably don't want to support Windows XP and Vista anyway.
SSSE3 should be supported by any Intel CPU in the "Core" family and newer, e.g. Core 2 Duo, Core 2 Quad, Core i%d. Also includes Atom. SSSE3 was first introduced in 2006.
SSSE3 should be supported by any AMD CPU from 2011 onward. (They supported SSE3 earlier than that though.)
SSE4.2 is supported by AMD in Bulldozer (2011). They supported earlier SSE4 variant SSE4a with partial coverage of SSE4.2 and support for SSE4.1 in Barcelona (late 2007/early 2008)
SSE4.2 is supported by Intel from Nehalem (late 2008) onward.

For future reference, AVX is 2011 for both AMD and Intel, and AVX2 is 2013 for Intel and 2015 for AMD.

I think turning on SSE2 in all Windows builds is the obvious choice, and if SSE4.2 intrinsics using code is written that does make things faster I can also see us turning on SSE4.2 for all x86_64 builds. The only non-SIMD builds we'd then be left with is 32-bit Linux, and I need to check web log stats to see how many people actually do use the 32-bit Linux repos.

According to a multimedia developer I just spoke to, SSE2 is a good baseline to start from.

Wunkolo commented 5 years ago

Referencing this PR to show one of the changes that resulted from this: https://github.com/overviewer/Minecraft-Overviewer/pull/1541

Any insight on the ARM demographic of hardware running overviewer? What are some safe SIMD assumptions since all implementations of AArch64(ARMv8) are mandated to have NEON as a standard(similar to how x64 requires SSE2). ARM in particular could use all the speedups it can get.

CounterPillow commented 5 years ago

Most people running on ARM are probably either using an rpi or similar trash SBC (so arm7hf and AArch64) or possibly something like a scaleway instance. AArch64 NEON is significantly different from arm7 NEON from what I've heard, but even most ARMv7 boards we're interested in would have NEON.

rafalohaki commented 2 years ago

Does SIMD support can optimize server spigot?

Wunkolo commented 2 years ago

While most compilers will default to compiling up to SSE2. I've been able to get optimized versions of the native code with export CFLAGS="-march=native" on Linux which enables even further instruction set features. So I've added additional levels of SIMD to block_class_is_subset to complete the scope of this particular issue of defining a SIMD-precedent.

Does SIMD support can optimize server spigot?

There is no shared code between these two projects. This is to speedup map rendering operations.

MarcDoughty commented 2 years ago

Any insight on the ARM demographic of hardware running overviewer?

My experience was that running Minecraft Server >1.16 on anything smaller than a Raspberry Pi 4 w/4GB RAM was too painful. Overviewer on four cores also uses over 2GB RAM, which limits practical usage to the ARMv8 RPI 4.

Just my two cents, but don't think it's worth targeting any 32-bit stuff anymore, including ARM. I know RaspiOS still has a 32-bit version available, but the class of ARMs people would use for a Minecraft server have supported 64-bit operating systems that perform better than the 32-bit ones. I think 64-bit ARMv8 would be a reasonable floor for Overviewer if a decision about how to support SIMD had to be made.

CounterPillow commented 2 years ago

I already added aarch64 simd for one part, and yes I'm not going to bother with 32-bit stuff that's only used by rpi people for dumb proprietary footguns

overviewer / Minecraft-Overviewer

SIMD acceleration #1538