Open homm opened 7 years ago
In general, SIMD code will looks like this:
#if defined(__SSE4__)
#include <emmintrin.h>
#include <mmintrin.h>
#include <smmintrin.h>
#if defined(__AVX2__)
#include <immintrin.h>
#endif
#endif
void
ImagingResampleSomething(UINT32 *lineOut, UINT32 *lineIn, int xmax)
{
var x = 0;
#if defined(__AVX2__)
for (; x < xmax - 7; x += 8) {
// AVX2 code
}
#endif
#if defined(__SSE4__)
for (; x < xmax - 3; x += 4) {
// SSE4 code
}
#endif
for (; x < xmax; x += 1) {
// General x86 code
}
}
If we are compiling the code with -mavx2
argument, general + SSE4 + AVX2 version is generated. If -msse4
— general + SSE4 version. Without flags, only general version is compiled.
We have to set compiler flags in setup.py
, so we'll need access to CPUID from the setup. There is a PyCPUID package but is doesn't support AVX
at all.
Cons:
Suggested by @socketpair. There are three .so files compiled with different flags:
-msse4
-mavx2
We are detecting CPUID and load appropriate modules from python code.
Pros:
Cons:
Suggested by @ToSHiC here (rus). We pass compiler flags to each .c file, not whole module. So we can compile several .c files to the different .o files. Of event the same file with different flags to the different .o files and link them all.
void
#if defined(__AVX2__)
ImagingResampleSomething_avx2(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#elif defined(__SSE4__)
ImagingResampleSomething_sse4(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#else
ImagingResampleSomething_general(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#endif
{
// Same code as for compile-time decision
}
Pros:
Cons:
I think that 3x compile time is not a problem at all novadays. Simplicity is the key, especially when no performance impact occurs.
Linking .o
files compiled with different options may gain some nasty errors and AFAIK is not guaranteed to work. Compiler may use it's own kitchen on alignments and other internal stuff. Also, these options affect intrisincs like memcpy
, so even non-specialized code will work faster when compiling with appropriate options.
I vote for my variant :)
Also, it will be nice yo have API to force loading of .so
with specific optimisation level. Say, for testing how optimisations are effective, or if automatic CPUID detection choose wrong version.
There was a similar conversation on the Tensorflow issue tracker with some suggested approaches: https://github.com/tensorflow/tensorflow/issues/7257#issuecomment-295814405.
Would it be possible to package just the SIMD-accelerated routines, i.e. not duplicate all of Pillow in the fork?
I.e.
from PIL.Image import Image
from pillow_simd import resize
i = Image.open(...)
i = resize(i, (500, 500), ...)
I have the same problems with projects at work. So I am working on this for trying find a easy to use solution.
Hi, I have interest in this issue, and implemented AVX2 / SS4 runtime switching (https://github.com/grafi-tt/pillow-simd/commit/96f18c6292df46c47e1bc8a9aa382e9658ba16d5)
The usage is like:
import PIL
from PIL import Image
import timeit
IMG_FILE = "image-1920x1200.png"
print(PIL.get_available_builds())
print(PIL.get_build())
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)
PIL.set_build("SSE4")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)
PIL.set_build("AVX2")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)
PIL.set_build("SSE4")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)
and the following is its (possible) output.
['AVX2', 'SSE4', 'GENERIC']
AVX2
5.207448414999817
6.874838170999283
5.21245184000054
6.8702469989993915
Following the strategy @socketpair and @homm mentioned, I made the core imaging module (_imaging
) built twice with appending distinct suffixes to their names. Runtime CPU checking is done in a tiny C module which calls cpuid
instruction using inline assembly. I believe py-cpuinfo
is overkill, especially for a module with few dependencies.
Generic build (AVX2 or SSE4 not used) is not available yet, as making it requires back-porting Pillow's non-SIMD codes. Though I could manage to do this, but it is difficult as I'm new to Pillow and Pillow-SIMD. I think it is an obvious work for developers used to the codebase of Pillow-SIMD.
If someone is interested, feel free to fork and improve my implementation; or if preferred, I'm happy to send a PR.
Project I linked below is completed and may be a solution to this problem. I didn't try it with Pillow-simd but it should to be very easy to implement with something like this:
# Top of "PIL/Image.py"
try:
import compilertools
except ImportError:
pass
# "setup.py"
# [...]
from setuptools import Extension, setup, find_packages
try:
import compilertools.build
compilertools.build.ConfigBuild.suffixes_includes = ['avx2', 'sse4']
except ImportError:
pass
# [...]
setup(
# [...]
install_requires=['...', 'compilertools'],
# [...]
)
The best way to solve the compile-time decision problem on recent-ish gcc
versions would be to use the target
attribute (see gcc documtionation) on functions. This causes the compiler to emit several versions of a function, each for a different architecture, with some magic to have the dynamic linker fix up the relocation tables based on run-time CPU detection at library load time. The problem is, that attribute isn't well supported on other compilers. clang
, for example, should be getting support for that attribute in version 7, which will be out in September, hopefully.
Well, using poor inline assembly codes (like what I've wrote) is admittedly fragile. It surely causes tricky issues. (E.g. https://stackoverflow.com/questions/12221646/how-to-call-cpuid-instruction-in-a-mac-framework).
Using the target
attribute seems to be reasonable. Its support is actually not so recent-ish (from 4.8: https://gcc.gnu.org/wiki/FunctionMultiVersioning), and waiting Clang's support is acceptable. Even ARM NEON support would be possible in future :) Intel C Compiler also has a similar feature, so someone interested can write a patch to support it.
Regarding manual detection, the problem is there is no open-source simple, portable, easy-to-use and stable library for CPU detection. (https://wiki.linaro.org/LEG/Engineering/OPTIM/Assembly#Hardware_identification) Using __builtin_cpu_supports
intrinsics is one possibility. It's support is in GCC 4.8.0 and Clang 3.9.0 or later. (Clang 3.7 and 3.8 had broken implementation: https://bugs.llvm.org/show_bug.cgi?id=25510). Numpy already uses this (https://github.com/numpy/numpy/blob/e0f884692f629d931d3673adc8dd6b0938cefa52/numpy/core/src/umath/cpuid.c). Intel C Compiler again has a similar feature and MSVC has a more primitive cpuid
intrinsics, so supporting them is possible if desired.
Different functions in different .o files
If you want to go down this route, you might find some inspiration in pyfastnoisesimd. At build time, compiler support is queried here and the relevant files are enabled. At runtime, this function is used to determine support and the correct function is dispatched based on the results.
It remains not straightforward to have Pillow-SIMD
installed through dependencies (e.g. pytorch
can only list Pillow
).
The README
mentions Why do not contribute SIMD to the original Pillow
but doesn't OpenCV
have the same concern?
It seems like OpenCV
manages to distribute through anaconda repo with avx2
support through CPU dispatch code generation such that rebuilding on the target platform is not necessary.
Is there any other difficulty preventing this kind of approach?
As stated before, CPU dispatch is tricky to get right in a platform-independent way. OpenCV is a much larger project with more resources to maintain the build apparatus for doing such things properly, in particular doing dynamic dispatch in a way that doesn't severely impact runtime performance.
If you were only targetting gcc
, then the target_clones
attribute would basically solve the problem for you. Otherwise things get a bit more complicated.
Both clang and gcc support the target
attribute for compiling a given method with a different architecture from the rest of the code. So you can kind of fake target_clones
by making e.g. foo_avx
and foo_sse
and so on, having each of them calling foo_base
(using also the flatten
to make sure that method gets inlined into the arch-specific versions), and then use the ifunc
to set up the resolver for dynamic dispatch at load time. So it can be done, just a lot of yak shaving. In terms of manually writing the resolver, https://github.com/google/cpu_features might be overkill but does solve that problem for you.
NumPy now has a pretty comprehensive SIMD build and dispatch system, but I don't know how portable it would be to Pillow since it doesn't depend on NumPy.
I'll just drop https://github.com/google/highway in here.
Problems and solutions of merging SIMD code to the main codebase.