**Issue №2468 opened by robertpfeiffer at 2021-02-05 17:49:47**

[ ] make tests CPU-aware: test all available impls in test suite (CPU arch permitting)
[ ] ensure ALL impls are built in bdist_wheel (this includes code for better than current CPU)
[ ] Inspect generated assembly on MSVC, GCC, Clang, ICC, Apple Clang
[ ] decide/document which compilers are supported
[ ] document CPU dispatch guidelines
[ ] nice to have: optionally disable dispatch for arch-specific builds, hard-code best path (for Gentoo users)
[ ] nice to have: optionally disable all dispatch in setup.py
[ ] either: annotate arch-specific functions (compiler-specific)
[ ] or: move arch-specific functions into different files (add arch to Setup.in)
[ ] maybe: annotate certain functions with identical code but different target arch, trust C compiler to vectorise
[ ] x86
[ ] x86_64
[ ] ARMv6
[ ] ARM64
[ ] ensure backward compatibility with most C99 compilers

Comments

**robertpfeiffer* commented at 2021-02-19 11:42:31*

@MyreMylar @illume, I'd like to hear some feedback on this before I continue.

I wrote up my initial thoughts on CPU dispatch here: https://blubberquark.tumblr.com/post/642089979599798272/cpu-dispatch-is-a-minefield

I also think MMX-only code does not hurt, but it probably obsolete. There are no new processors with MMX and no SSE any more, and there haven't been for 10 years.

I assume we won't use GCC-specific CPU dispatch at load time: https://www.agner.org/optimize/blog/read.php?i=167 Clang has something similar: https://releases.llvm.org/10.0.0/tools/clang/docs/AttributeReference.html# cpu-dispatch

The most portable thing is to keep using SDL2 run-time CPU queries. They are cached, and should be fast enough.

The problem is just getting GCC and MSVC to accept code that uses the intrinsics in a way that is guaranteed to emit the correct code. Here is an example somebody accidentally using intrinsics without effect: https://stackoverflow.com/questions/6427473/how-to-generate-a-sse4-2-popcnt-machine-instruction

Using ifdefs to ensure it still runs on tcc, scc, and zig cc is easy enough.

**MyreMylar* commented at 2021-02-19 14:54:30*

I wrote up my initial thoughts on CPU dispatch here: https://blubberquark.tumblr.com/post/642089979599798272/cpu-dispatch-is-a-minefield

I think I agree with all of that. It is a complicated mess for sure :) Ideally we would have some kind of regularly produced performance testing report across a wider range of platforms for all our 'CPU dispatch' style functions. While it doesn't need performance testing every time someone adds a PR it'd be nice if it could exist more frequently than 'somebody tries it on hardware they personally have and reports performance'. That way we could more easily spot regressions and problematic platforms when writing this sort of code.

Is it certain that a compiler running on hardware that doesn't support SSE2 will still generate SSE2 code if asked? I mean that makes sense to me but who knows. I think currently lots of our SSE2/Neon code is gated with __SSE2__ and NEON # ifdefs that could probably be x86_x64/Arm ones instead. Something to bear in mind is that because of the conversion library the 'intrinsics' we are using in alphablit.c are actually redirected via the conversion library so the see2 looking intrinsic functions are actually compiled on both Arm and Intel/AMD from there. I assume there is an architecture split in the sse2neon.h header file. Likely we should still check for these two architectures anyway in case PowerPC makes a comeback

I also think MMX-only code does not hurt, but it probably obsolete. There are no new processors with MMX and no SSE any more, and there haven't been for 10 years.

Agreed. I think I ported something from SDL that was mmx when I first started tinkering with this before settling on SSE2 as being a more reasonable standard to use given it's widespread adoption. Probably we could remove that old mmx port if it's not doing anything useful anymore (can't recall off the top of my head).

I assume we won't use GCC-specific CPU dispatch at load time: https://www.agner.org/optimize/blog/read.php?i=167 Clang has something similar: https://releases.llvm.org/10.0.0/tools/clang/docs/AttributeReference.html# cpu-dispatch

That sounds too annoying to me. It's enough of a pain wrestling with intrinsic functions that only exist on 64bit.

The most portable thing is to keep using SDL2 run-time CPU queries. They are cached, and should be fast enough.

Yep. We are using these now both on pygame's side and of course via SDL using them in the same way. No need to re-invent the wheel here.

The problem is just getting GCC and MSVC to accept code that uses the intrinsics in a way that is guaranteed to emit the correct code. Here is an example somebody accidentally using intrinsics without effect: https://stackoverflow.com/questions/6427473/how-to-generate-a-sse4-2-popcnt-machine-instruction

Always a struggle, though we could also optimise to the compiler too much and then the compiler changes what it's doing and the work goes in the bin. I still think intrinsic functions are preferable to assembly, just because they are slightly more readable and maintainable. Plus, we also have that auto conversion from SSE2 to NEON via a library which saves a lot of thinking about writing Neon versions as well.

Anyway to summarise again my knowledge of this:

Right now I think the only relevant CPU dispatch functions on 'our side' are in alphablit.c, they are 90% written with SSE2 style intrinsic functions and likely the one mmx one can be retired if needed.
These SSE2 style intrinsic functions are also used for Arm neon support. To do this we use a header called sse2neon that turns the SSE2 intrinsic functions into equivalent Arm Neon versions. So these functions code in alphablit.c need to be available to both x86/x64 hardware and Arm hardware.
Right now we are using the SDL runtime detection hardware functions to decide which function to 'dispatch' to. This seems fine and is the way SDL does it as well. Usually it is case of 'does this hardware support SSE2 or NEON?' if yes run the SSE2/ported SSE2 version of the function - if no run a fall-back, non-SIMD all platform version.
In a few cases the intrinsic functions used are slightly different in a 64bit compile, this is just because the 64 bit load is a lot faster to load 128bits of data into the registers than the 32 bit version for fairly obvious reasons. generally loading and unloading data into the SIMD registers is the slowest part so this seemed useful enough to bother with. Since we have 64 bit wheels anyway, why not make use of them.

I think the goal of splitting all the SIMD functions into their own .c file was basically to make a single .o file/compile unit or whatever that is called. I don't believe we ever finally determined or tested if this would make a difference over the current situation where the functions are all jumbled in the single alphablit.c file. If code exists in a binary that a computer cannot run, but it also never runs it due to runtime checks - does that still cause problems? If it does, does putting the code in an isolated .c file mitigate them? Those seemed to be the fundamental questions that initially stopped us from compiling with neon enabled by default in the compiled wheels for ARM (which was basically 32bit raspberry pi at the time).

I think we should have all this dispatch stuff on by default for sure. A generic 'disable all dispatch' flag could replace the current one that enables dispatch just for Arm Neon. That was added when we decided to release without SIMD intrinsic functions on Pi. Piwheels is now able to support separate arm6 and arm7 wheels so there isn't any reason to disable it by default anymore so long as we let piwheels know to build the two separate wheels (if necessary) once we have finished tinkering with this.

I think there are several more blitting blend modes beyond just the default alpha blit (ADD, SUB, MUL) that would benefit a lot from SIMD - which I'm willing to tackle at some point. Assuming that we get it into a state everyone is happy with. Though there is also a point I've probably already reached where it is likely more worthwhile to focus on the new hardware accelerated SDL Texture stuff for performance gains :)

Any specific questions I might be able to help with, always feel free to ask.

**robertpfeiffer* commented at 2021-02-21 16:02:51*

Is it certain that a compiler running on hardware that doesn't support SSE2 will still generate SSE2 code if asked?

Well, no compiler is guaranteed to actually generate SSE2 code. But as far as I know, all x86_64 processors have MMX and SSE2, and the big compilers (MSVC, ICC, GCC, clang) all generate SSE2 correctly.

I assume I am a pedant/you meant to ask "Can the feature set of my current CPU somehow leak into the compiler backend during code generation? Do I need to actually have to run the compiler on Ryzen 7 to generate code for the Ryzen 7?" Barring some edge cases with gcc's libgcc that require three compilation passes when you bootstrap GCC or build a cross-compiler toolchain (reflections on trusting trust and all that), the answer is no. As long as you have a recent-ish version of GCC or clang, you can emit code for a zen3 ryzen 7 or a tiger lake core i9 even on an old atom CPU and all major compilers respect the CPU flags.

That sounds too annoying to me. It's enough of a pain wrestling with intrinsic functions that only exist on 64bit.

100% agreed, but I thought I'd link to these resources anyway. I am pretty sure these features won't come to MSVC because of obscure reasons (or they would be implemented already), and also not to ICC for rather obvious reasons.

I think the goal of splitting all the SIMD functions into their own .c file was basically to make a single .o file/compile unit or whatever that is called.

Having all SSE SIMD functions in one .c file (and all NEON/AVX/riscv bitmanip SIMD functions their own files as well) allows us to compile that file with the appropriate cflags, which may differ from the cflags for the rest of the code. The cflags are probably much easier to get right in a portable fashion, especially compared to pragmas and annotations, but we can probably use # ifdefs to make every compiler only use its own compatible compiler-specific annotations and ignore the others.

If code exists in a binary that a computer cannot run, but it also never runs it due to runtime checks - does that still cause problems?

As far as I can tell, the CPU would have to solve the halting problem to complain about this ;-)

I think currently lots of our SSE2/Neon code is gated with SSE2 and NEON # ifdefs that could probably be x86_x64/Arm ones instead.

This is probably the key takeaway from my blog post, which was more or less prompted by my investigation into # 2455. If we have a run-time check for NEON, checking for -mneon at compile time is redundant, but then we better actually generate NEON instructions in the NEON path!

The thing with the lesser-known compilers like tcc, scc, and zig cc is just making sure that the dispatch code is skipped and we default to the generic implementation instead of making the compiler quit with an error when it encounters unknown annotations or intrinsics.

**robertpfeiffer* commented at 2021-03-05 06:46:22*

We should probably look at the way dispatch is implemented inside numpy and SDL itself.

**MyreMylar* commented at 2022-04-07 10:26:22*

Some work on this has been done with the RGBA_MULT PR here:

https://github.com/pygame/pygame/pull/2988

The part about dividing some code up into a separate .c file and setting a specific compiler flags for that one file is used for the avx2 code.

Assuming this approach works out satisfactorily I expect a lot of the existing SSE/NEON/ARM code could be refactored to more closely model after this.

I think there are also some performance wins awaiting in all that somewhere (AVX2 version of basic alpha blitters for windows and other x86/x64 & better SSE & better ports for arm platforms).

pygame-community / pygame-ce

Refactor/Unify CPU dispatch tasks (2468) #1266

**Issue №2468 opened by robertpfeiffer at 2021-02-05 17:49:47**

Comments

**robertpfeiffer* commented at 2021-02-19 11:42:31*

**MyreMylar* commented at 2021-02-19 14:54:30*

**robertpfeiffer* commented at 2021-02-21 16:02:51*

**robertpfeiffer* commented at 2021-03-05 06:46:22*

**MyreMylar* commented at 2022-04-07 10:26:22*

pygame-community / pygame-ce

Refactor/Unify CPU dispatch tasks (2468) #1266

Issue №2468 opened by robertpfeiffer at 2021-02-05 17:49:47

Comments

*robertpfeiffer commented at 2021-02-19 11:42:31*

*MyreMylar commented at 2021-02-19 14:54:30*

*robertpfeiffer commented at 2021-02-21 16:02:51*

*robertpfeiffer commented at 2021-03-05 06:46:22*

*MyreMylar commented at 2022-04-07 10:26:22*

**Issue №2468 opened by robertpfeiffer at 2021-02-05 17:49:47**

**robertpfeiffer* commented at 2021-02-19 11:42:31*

**MyreMylar* commented at 2021-02-19 14:54:30*

**robertpfeiffer* commented at 2021-02-21 16:02:51*

**robertpfeiffer* commented at 2021-03-05 06:46:22*

**MyreMylar* commented at 2022-04-07 10:26:22*