Open picnixz opened 1 month ago
My general opinion about managing SIMD logics in CPython side: https://github.com/python/cpython/issues/124951#issuecomment-2395455022 And if we begin to depend on SIMD detection, do you have any concern that an unexpected illegal instruction error can occur from the unsupported machine side because of the difference between the build machine and the execution machine?
I do have concerns and that's why I'd like to hear from people that 1) know about weird architectures 2) deal with real-life scenarios.
What I have in mind:
-mavx
may not be sufficient (e.g., we also need to handle XSAVE and how it handles YMM registers). So, yes, I definitely have concerns on the differences. Using SIMD instructions could probably make local builds faster or builds managed by distributions themselves though we should be really careful. This is also the reason why I want to keep runtime detection to avoid issues.
The idea was to open a wider discussion on SIMD support itself. If you want we can move to Discourse though I'm not sure whether it's better to keep it internal for now (the PR is just a PoC and it probably won't cover those cases we're worried about).
I don't think we should add SIMD for every possible parts of the library, only those that are critical enough IMO. And they should be carefully worked out. However, in order to investigate them (and test them using the CI), I think having an internal detection framework would at least be the first step (or maybe I'm wrong here?).
I've harden the detection of AVX instructions. I've also learned that macOS may not like AVX-512 at all (or at least some registers states won't be restored correctly upon context-switching). So there are real-life issues that we should address. What I'll maybe do is first try to make a PoC for str.translate
and see how AVX could be used and how it could improve Python, then I'll come back (as Gregory said on the othere issue, we are targetting relatively simpler algorithms).
Hello, I'm one of the maintainers of pygame-ce, a Python C extension library that uses SIMD extensively to speed up pixel processing operations. We've had various bits of SIMD for a long time and use runtime checks to manage it. I'd like to share some information about our approach, in the hope it is helpful.
We SIMD accelerate at the SSE2 and AVX2 levels. SSE2 is part of the baseline of x86_64, but we've also had no problems with it on our 32 bit builds. AVX2 is where isolation and runtime checking is much more important.
Each SIMD level of a module has its own file and is compiled into its own object. See https://github.com/pygame-community/pygame-ce/blob/6e0e0c67c799c7cc1fa9c96a71598a7751ae2fba/src_c/simd_transform_avx2.c for an example. Our build config for this looks like so: https://github.com/pygame-community/pygame-ce/blob/6e0e0c67c799c7cc1fa9c96a71598a7751ae2fba/src_c/meson.build#L215-L254. In this example, our transform module is not compiled with any special flags, but it is linked with objects that expose functions that can be called to get SIMD acceleration. An example of how the dispatch looks: https://github.com/pygame-community/pygame-ce/blob/6e0e0c67c799c7cc1fa9c96a71598a7751ae2fba/src_c/transform.c#L2158-L2181.
The SIMD compilation itself is very conservative, it will only compile the backend if the computer doing the build supports that backend, using compile time macros to check that. I'm not sure if this is actually necessary.
About our SIMD code itself, we use intrinsics rather than hardcoded assembly or frameworks like https://github.com/google/highway. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html is a great reference on this. Intrinsics are better for us than hardcoded assembly because they are more portable, both between compilers and even between architectures. For example we compile all of our "SSE2" code to NEON for ARM support using https://github.com/DLTcollab/sse2neon. Emscripten also allows compile time translation of SIMD intrinsics to Webassembly SIMD, https://emscripten.org/docs/porting/simd.html, although we do not take advantage of this currently.
For runtime detection, we rely on https://github.com/libsdl-org/SDL, which is very easy for us because our entire library is built on top of the functionality provided by SDL. If you'd like to check your PR against their implementation of runtime checks, the source seems to be here: https://github.com/libsdl-org/SDL/tree/main/src/cpuinfo
I think there could be value in exposing runtime SIMD support checking into the public C API for extension authors that aren't lucky enough to have an existing dependency to rely on for this. I've followed issues about Pillow and Pillow-SIMD where the authors are like these are impossible to merge because we don't have the resources to figure out runtime SIMD checks. I don't think it would get a ton of usage, but it would be extremely valuable functionality for any who need it.
In terms of CPython and SIMD I'm not sure how much potential there is, but there may be cool things that could be done. Could 4 or 8 or 16 PyObjects get their reference count changed at once? Could unboxed integers do arithmetic in parallel?Could the JIT decide to use more efficient templates because it knows AVX2 is around? Knowing the runtime SIMD level is an advantage for a JIT over an AOT compiler. But these are just my musings.
@Starbuck5 Thank you very much for all these insights!
I'd like to share some information about our approach, in the hope it is helpful.
It was definitely helpful.
Each SIMD level of a module has its own file and is compiled into its own object
Yup, that's what the blake2 authors did so we'll probably do something similar. One thing is that it could lead to blowing the code up if we have many different levels... or if we decide to split up files according to architecture itself (like it is done in https://github.com/aklomp/base64/tree/22a3e9d421ee25b25bc6af7a02d4076c49dd323f/lib/arch for instance). I personally think it's nicer to split them by architecture and folders but this leads to too many files and similar ones (which is not very nice for maintaining the whole thing).
The SIMD compilation itself is very conservative, it will only compile the backend if the computer doing the build supports that backend, using compile time macros to check that. I'm not sure if this is actually necessary.
I think it's always better to be safe than sorry, unless we're absolutely sure that we won't cause #UD
at runtime.
About our SIMD code itself, we use intrinsics rather than hardcoded assembly
I also think it's better to use intrinsics for the same reasons as you cited but also because you don't need to know about ASM :') (and portability is key). The projects for translating intrinsics will definitely be helpful in the future if we were to eventually use SIMD instructions.
For runtime detection, we rely on libsdl-org/SDL, which is very easy for us because our entire library is built on top of the functionality provided by SDL
Thanks for this. I'll probably borrow some of their ideas but I don't think we can vendor this specific part of their library in CPython :( However it will definitely help in improving the detection algorithm (for now the algorithm is quite crude).
I think there could be value in exposing runtime SIMD support checking into the public C API for extension authors that aren't lucky enough to have an existing dependency to rely on for this [...] In terms of CPython and SIMD I'm not sure how much potential there is, but there may be cool things that could be done
That was my original intent, though limited to the CPython internals. Python is great but Python is sometimes slow on some aspects, and it'd be great if we could make it faster. We can always make Python faster by changing algorithms but if we have the possibility of making it faster using CPU features then we should probably try to benefit from them, at least in the important areas.
Could 4 or 8 or 16 PyObjects get their reference count changed at once
We could in some situations do it but this will probably need to be synchronized with the ongoing work on deferred reference counts (or so I think).
Could unboxed integers do arithmetic in parallel
@skirpichev do we have places where the arithmetic could be sped up using SIMD instructions? I think we either rely on mpdecimal or glibc directly for "advanced" arithmetic and I don't know whether we have a lot of places where we have additions in batches for instance.
Could the JIT decide to use more efficient templates because it knows AVX2 is around? Knowing the runtime SIMD level is an advantage for a JIT over an AOT compiler
I'm not JIT expert :') so let's ask someone who knows about it: @brandtbucher
do we have places where the arithmetic could be sped up using SIMD instructions?
AFAIK, GMP isn't utilizes this too much so far.
Could the JIT decide to use more efficient templates because it knows AVX2 is around? Knowing the runtime SIMD level is an advantage for a JIT over an AOT compiler
I'm not JIT expert :') so let's ask someone who knows about it: @brandtbucher
For now, we pre compile the JIT code from template. So it depends on the meachine we use to release the official binary(But the JIT is not a default feature yet.). As I know, we dont have any instruction detect on the JIT build script, in another world, Use SIMD or not is depends on the compiler decision.
Could unboxed integers do arithmetic in parallel?
I think arithmetic is not a common use case. The people may take care of there data layout to fit the parallel requirement. Otherwise, it may slower than normal operation.
IMHO, I think some string operation it more suitable for SIMD, like JSON operation or pickle operation. FYI https://github.com/simdjson/simdjson
For now, we pre compile the JIT code from template. So it depends on the meachine we use to release the official binary(But the JIT is not a default feature yet.). As I know, we dont have any instruction detect on the JIT build script, in another world, Use SIMD or not is depends on the compiler decision.
On x86 the default compilation (at least on MSVC) goes up to SSE2. So there could be auto-vectorization opportunities. I actually investigated this quite a bit ago and got the JIT templates to compile with an explicit AVX2 flag, and it barely changed the templates at all. I think it would have to be more intentionally set up, like if unboxed integer arithmetic becomes a thing there could then be a uop that does 2 / 4 / 8 arithmetics at once and then the compiler would be able to do some auto vectorization there.
But I'm fully aware all my ideas about this and the refcount thing are just ideas, not anywhere close to a concrete proposal.
I think arithmetic is not a common use case. The people may take care of there data layout to fit the parallel requirement. Otherwise, it may slower than normal operation.
I think the popularity of Numba showcases demand for higher performance number crunching.
In terms of actionable SIMD items, an internal api for runtime detection seems like a great step to take. An external api could also be helpful to certain projects.
I haven't looked in detail into the blake SIMD implementations, but if it just supports x86 right now it would be possible to bring those speedups to ARM using sse2neon.
Personally I've never done any string operations with SIMD, but I agree with @Zheaoli that there is certainly potential to speed up things with it!
Hello, thanks for raising this. I think there is definitely some room for vector instructions in CPython. In the coming weeks I'll spend some time investigating and I'll be watching this space as well.
Feature or enhancement
Proposal:
In https://github.com/python/cpython/issues/124951, there has been some initial discussion on improving the performances of
base64
and possibly{bytearray,bytes,str}.translate
using SIMD instructions.More generally, if we want to use specific SIMD instructions, it'd be good if we at least know whether the processor supports them or not. Note that we already support SIMD in
blake2
when possible. As such, I suggest an internal framework for detecting SIMD features for other part of the library as well as a compiler flag support detection.Note that a single part of the code could benefit from some SIMD calls without having to link the entire library against the entire SIMD-128 or SIMD-256 instruction sets. Note that having a way to detect SIMD support should probably be independent of whether we would use them or not apart from the blake2 module because it could only benefit the standard library if we were to include them.
The blake2 module SIMD support is fairly... complicated due to the wide variety of platforms that need to be supported and due to the mixture of many SIMD instructions. So I don't think I want to touch that part and make it work under the new interface (at least, not for now). While I can say that I'm confident in detecting features on "widely used" systems, there are definitely systems that I don't know so I'd appreciate any help on this topic.
Has this already been discussed elsewhere?
I don't want to open a Discourse thread for now since it's mainly something that will be used internally and not to be exposed to the world.
Links to previous discussion of this feature:
There has been some discussion on Discourse already about SIMD in general and whether to include them (e.g., https://discuss.python.org/t/standard-library-support-for-simd/35138) but the number of results containing "SIMD" or "AVX" is very small. Either this is because the topic is too advanced (detecting CPU features is NOT funny and there is a lack of documentation, the best one being the Wikipedia page) or the feature request is too broad.
Linked PRs