Open Wunkolo opened 5 years ago
Determine it at compile time, juggling around function pointers of hot functions at runtime based on cpuid is just going to prevent inlining, which is bad. Personally I'd either target SSSE3 or AVX2. We still support 32-bit x86 for the time being, and we also support non-x86 based architectures such as ARM or AArch64.
I won't merge anything that isn't properly profiled and benchmarked though to ensure that it actually does make things faster.
Sounds fair, gcc and clang both provide compile-time preprocessor definitions for the target architecture, though windows only defines them up to AVX2 I think
gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX"
#define __AVX512F__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __AVX512BW__ 1
#define __SSE2_MATH__ 1
#define __AVX__ 1
#define __AVX512VL__ 1
#define __AVX512CD__ 1
#define __AVX2__ 1
#define __SSE_MATH__ 1
#define __AVX512DQ__ 1
#define __SSE2__ 1
#define __SSSE3__ 1
#define __SSE__ 1
#define __SSE3__ 1
Though it will expose the code to some "preprocessor smell" the more it scales across functions and architectures if you are also fine with that(with mitigations in C++ but in C the only metaprogramming at your disposal is preprocessor-hell anyways).
As a quick example and cast study, here's how something like the is_stairs
function would be vectorized which is called around multiple times in critical code in iterate.c
Benchmarks show the vectorized version is about x2 faster than the serial method
#if defined(__i386__) || defined(__x86_64__)
#include <immintrin.h>
#endif
int is_stairs(int block) {
/*
* Determines if a block is stairs of any material
*/
#ifdef __SSE2__
const __m128i stair_blocks = _mm_set_epi8(
53, /* oak wood stairs */
67, /* cobblestone stairs */
108, /* brick stairs */
109, /* stone brick stairs */
114, /* nether brick stairs */
128, /* sandstone stairs */
134, /* spruce wood stairs */
135, /* birch wood stairs */
136, /* jungle wood stairs */
156, /* quartz stairs */
163, /* acacia wood stairs */
164, /* dark wood stairs */
180, /* red sandstone stairs */
203, /* purpur stairs */
203,// Just repeating the last element
203 // as "filler"
);
const __m128i block_vec = _mm_set1_epi8(block);
return !!_mm_movemask_epi8(_mm_cmpeq_epi8(stair_blocks,block_vec));
#else
switch (block) {
case 53: /* oak wood stairs */
case 67: /* cobblestone stairs */
case 108: /* brick stairs */
case 109: /* stone brick stairs */
case 114: /* nether brick stairs */
case 128: /* sandstone stairs */
case 134: /* spruce wood stairs */
case 135: /* birch wood stairs */
case 136: /* jungle wood stairs */
case 156: /* quartz stairs */
case 163: /* acacia wood stairs */
case 164: /* dark wood stairs */
case 180: /* red sandstone stairs */
case 203: /* purpur stairs */
return 1;
}
return 0;
#endif
}
Here's how it compiles across different platforms, pretty much all x64 compilers emit the SSE2 vectorized version and the 32-bit and ARM64 compilers fall back to the case-switch method. A NEON-accelerated implementation would add another #else if
to the include chain and within is_stairs. These vectorized checks would benefit a lot of these huge number == 4
, number == 52
, number == 67
, number = ...
chains that are found up and down a lot of the C code.
Is this a pattern you're fine with?
Can we somehow have this be a macro so we don't need to duplicate the stair IDs, or is this too much preprocessor magic?
I have an idea to approach this. For the mass-case-switch enum checks that determine if something is within a certain set of blocks there can be instead static uint8_t
block-set arrays and a block_in_set
sort of function that checks if it is found within it, that way all the vectorization and speedups can exist in one place by keeping things functional.
#include <stdint.h>
#include <stddef.h>
#define count_of(arr) (sizeof(arr) / sizeof(arr[0]))
const static uint8_t block_set_stairs[] = { 53, 67,108,109,114,128,134,135,136,156,163,164,180,203};
int block_in_set(
uint8_t block,
uint8_t* block_set, size_t block_set_size
)
{
size_t i;
for( i = 0; i < block_set_size; ++i )
{
if( block == block_set[i])
{
return 1;
}
}
return 0;
}
/// Somewhere in your code where you need to check if the block is a stair block
if( block_in_set(stage->block,block_set_stairs,count_of(block_set_stairs)) ) {
...
The current disassembly of the compiled C code with expansive if-statements creates incredibly vast forests of cmp
s and jmp
s due to guaranteed short-circuiting which totally thrash a CPU's branch predictor and blows up the assembly code.
But with a functional approach like this it would all re-use the same assembly and allow the compiler to make some cleaner optimization choices.
I've done some research on what is widely supported these days, to determine which SIMD level we could require as the minimum in our prebuilt binaries.
For future reference, AVX is 2011 for both AMD and Intel, and AVX2 is 2013 for Intel and 2015 for AMD.
I think turning on SSE2 in all Windows builds is the obvious choice, and if SSE4.2 intrinsics using code is written that does make things faster I can also see us turning on SSE4.2 for all x86_64 builds. The only non-SIMD builds we'd then be left with is 32-bit Linux, and I need to check web log stats to see how many people actually do use the 32-bit Linux repos.
According to a multimedia developer I just spoke to, SSE2 is a good baseline to start from.
Referencing this PR to show one of the changes that resulted from this: https://github.com/overviewer/Minecraft-Overviewer/pull/1541
Any insight on the ARM demographic of hardware running overviewer? What are some safe SIMD assumptions since all implementations of AArch64(ARMv8) are mandated to have NEON as a standard(similar to how x64 requires SSE2). ARM in particular could use all the speedups it can get.
Most people running on ARM are probably either using an rpi or similar trash SBC (so arm7hf and AArch64) or possibly something like a scaleway instance. AArch64 NEON is significantly different from arm7 NEON from what I've heard, but even most ARMv7 boards we're interested in would have NEON.
Does SIMD support can optimize server spigot?
While most compilers will default to compiling up to SSE2
.
I've been able to get optimized versions of the native code with export CFLAGS="-march=native"
on Linux which enables even further instruction set features. So I've added additional levels of SIMD to block_class_is_subset
to complete the scope of this particular issue of defining a SIMD-precedent.
Does SIMD support can optimize server spigot?
There is no shared code between these two projects. This is to speedup map rendering operations.
Any insight on the ARM demographic of hardware running overviewer?
My experience was that running Minecraft Server >1.16 on anything smaller than a Raspberry Pi 4 w/4GB RAM was too painful. Overviewer on four cores also uses over 2GB RAM, which limits practical usage to the ARMv8 RPI 4.
Just my two cents, but don't think it's worth targeting any 32-bit stuff anymore, including ARM. I know RaspiOS still has a 32-bit version available, but the class of ARMs people would use for a Minecraft server have supported 64-bit operating systems that perform better than the 32-bit ones. I think 64-bit ARMv8 would be a reasonable floor for Overviewer if a decision about how to support SIMD had to be made.
I already added aarch64 simd for one part, and yes I'm not going to bother with 32-bit stuff that's only used by rpi people for dumb proprietary footguns
A lot of the core critical C code in relation to image and block manipulation is implemented in a very serial matter though processors these days allow for vectorized instruction sets where multiple fragments of data can be processed in parallel rather than one at a time. The x64 instruction set architecture includes up to SSE2 and I feel that a lot of the C code can take advantage of these instruction sets though SIMD intrinsics exposed through headers like
immintrin.h
and such to possibly get huge improvements in speed for certain operations like image blitting and block/tile iteration and bit/byte manipulationI might look to contribute once I get a good feel of the code base but I am wondering how you would feel about having PRs that use these intrinsics to vectorize and speed up critical sections of code as this would make some assumptions about the minimum requirements for the end-user. In particular it will require that the user be on a 64 bit computer which is kind of reasonable in 2019 or it can be something determined at compile time so that these intrinsics are not used unless the user opts in for them(Which in that case, will allow the user to opt-in for even more vectorization extensions such as AVX2 or AVX512).