Open itamarst opened 1 month ago
This is what NumPy does for its handrolled SIMD support: you compile a version of the function for baseline backwards compat, another for AVX, another for AVX512, etc.. Dispatch happens at runtime, so you typically have to do this dispatch at a fairly high level.
Numba does this automatically so no user intervention needed.
C and C++ have support for this, Rust has a library that implements this.
Cython, notably, does not have a way to do this.
Another approach is to compile multiple versions of the whole extension, with different CPU targets (e.g. one for each microarchitecture level on x86-64). Then at import time pick the one that works for the current CPU.
This requires no changes to the compiled code, and a tiny bit of boilerplate in the Python and packaging sides. It should be possible to write a library that implements e.g. a setuptools extension to reduce boilerplate.
As a library author, I might want to get faster code without doing (much) extra work. Autovectorization can sometimes help: compilers can automatically convert code to use SIMD instructions. (Note this is "vectorization" with different meaning that what is typically used in Python world.)
Modern CPUs have many new instruction families that can potentially speed up code even more, e.g. in x86-64 world there's AVX, AVX2, AVX512, and more. There are problems with using these, however:
The typical solution is to target the least common denominator, but that means targeting CPUs as of 2003!
A better solution is compile multiple versions of the code, each version targeting a different level of SIMD support. The question is how to do this.