Contribute Pillow-SIMD back to Pillow

homm commented 7 years ago

Problems and solutions of merging SIMD code to the main codebase.

homm commented 7 years ago

Compile-time decision

In general, SIMD code will looks like this:

#if defined(__SSE4__)
    #include <emmintrin.h>
    #include <mmintrin.h>
    #include <smmintrin.h>

    #if defined(__AVX2__)
        #include <immintrin.h>
    #endif
#endif

void
ImagingResampleSomething(UINT32 *lineOut, UINT32 *lineIn, int xmax)
{
    var x = 0;

#if defined(__AVX2__)
    for (; x < xmax - 7; x += 8) {
        // AVX2 code
    }
#endif

#if defined(__SSE4__)
    for (; x < xmax - 3; x += 4) {
        // SSE4 code
    }
#endif

    for (; x < xmax; x += 1) {
        // General x86 code
    }
}

If we are compiling the code with -mavx2 argument, general + SSE4 + AVX2 version is generated. If -msse4 — general + SSE4 version. Without flags, only general version is compiled.

We have to set compiler flags in setup.py, so we'll need access to CPUID from the setup. There is a PyCPUID package but is doesn't support AVX at all.

Cons:

Pillow compiled on one machine will not run on full speed on other, or, worse, will contain illegal instructions.
There is no way to switch between SIMD-versions at runtime. Which makes unit testing harder and longer.
It is not clear how to detect CPUID at setup time.

homm commented 7 years ago

Different .so modules

Suggested by @socketpair. There are three .so files compiled with different flags:

General version with no additional flags
General + SSE4 with -msse4
General + SSE4 + AVX2 with -mavx2

We are detecting CPUID and load appropriate modules from python code.

Pros:

Easy switching between different versions (for unit tests or in case of problems with automatic detection)
Looks like easy building with setuptools

Cons:

3x compile time

homm commented 7 years ago

Different functions in different .o files

Suggested by @ToSHiC here (rus). We pass compiler flags to each .c file, not whole module. So we can compile several .c files to the different .o files. Of event the same file with different flags to the different .o files and link them all.

void
#if defined(__AVX2__)
    ImagingResampleSomething_avx2(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#elif defined(__SSE4__)
    ImagingResampleSomething_sse4(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#else
    ImagingResampleSomething_general(UINT32 *lineOut, UINT32 *lineIn, int xmax)
#endif
{
    // Same code as for compile-time decision 
}

Pros:

Easy switching between different versions (for unit tests or in case of problems with automatic detection)
Minor compilation time increase

Cons:

It's not clear how to do this with setuptools

socketpair commented 7 years ago

I think that 3x compile time is not a problem at all novadays. Simplicity is the key, especially when no performance impact occurs.

Linking .o files compiled with different options may gain some nasty errors and AFAIK is not guaranteed to work. Compiler may use it's own kitchen on alignments and other internal stuff. Also, these options affect intrisincs like memcpy, so even non-specialized code will work faster when compiling with appropriate options.

I vote for my variant :)

Also, it will be nice yo have API to force loading of .so with specific optimisation level. Say, for testing how optimisations are effective, or if automatic CPUID detection choose wrong version.

cancan101 commented 7 years ago

There was a similar conversation on the Tensorflow issue tracker with some suggested approaches: https://github.com/tensorflow/tensorflow/issues/7257#issuecomment-295814405.

akx commented 6 years ago

Would it be possible to package just the SIMD-accelerated routines, i.e. not duplicate all of Pillow in the fork?

I.e.

from PIL.Image import Image
from pillow_simd import resize
i = Image.open(...)
i = resize(i, (500, 500), ...)

JGoutin commented 6 years ago

I have the same problems with projects at work. So I am working on this for trying find a easy to use solution.

grafi-tt commented 6 years ago

Hi, I have interest in this issue, and implemented AVX2 / SS4 runtime switching (https://github.com/grafi-tt/pillow-simd/commit/96f18c6292df46c47e1bc8a9aa382e9658ba16d5)

The usage is like:

import PIL
from PIL import Image
import timeit

IMG_FILE = "image-1920x1200.png"

print(PIL.get_available_builds())
print(PIL.get_build())

img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

PIL.set_build("SSE4")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

PIL.set_build("AVX2")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

PIL.set_build("SSE4")
img = Image.open(IMG_FILE)
result = timeit.timeit('img.resize((960, 600), resample=Image.BICUBIC)', globals=globals(), number=1000)
print(result)

and the following is its (possible) output.

['AVX2', 'SSE4', 'GENERIC']
AVX2
5.207448414999817
6.874838170999283
5.21245184000054
6.8702469989993915

Following the strategy @socketpair and @homm mentioned, I made the core imaging module (_imaging) built twice with appending distinct suffixes to their names. Runtime CPU checking is done in a tiny C module which calls cpuid instruction using inline assembly. I believe py-cpuinfo is overkill, especially for a module with few dependencies.

Generic build (AVX2 or SSE4 not used) is not available yet, as making it requires back-porting Pillow's non-SIMD codes. Though I could manage to do this, but it is difficult as I'm new to Pillow and Pillow-SIMD. I think it is an obvious work for developers used to the codebase of Pillow-SIMD.

If someone is interested, feel free to fork and improve my implementation; or if preferred, I'm happy to send a PR.

JGoutin commented 6 years ago

Project I linked below is completed and may be a solution to this problem. I didn't try it with Pillow-simd but it should to be very easy to implement with something like this:

# Top of "PIL/Image.py"
try:
    import compilertools
except ImportError:
    pass

# "setup.py"

# [...]

from setuptools import Extension, setup, find_packages

try:
    import compilertools.build
    compilertools.build.ConfigBuild.suffixes_includes = ['avx2', 'sse4']
except ImportError:
    pass

# [...]

setup(
        # [...]
        install_requires=['...', 'compilertools'],
        # [...]
        )

adam-azarchs commented 5 years ago

The best way to solve the compile-time decision problem on recent-ish gcc versions would be to use the target attribute (see gcc documtionation) on functions. This causes the compiler to emit several versions of a function, each for a different architecture, with some magic to have the dynamic linker fix up the relocation tables based on run-time CPU detection at library load time. The problem is, that attribute isn't well supported on other compilers. clang, for example, should be getting support for that attribute in version 7, which will be out in September, hopefully.

grafi-tt commented 5 years ago

Well, using poor inline assembly codes (like what I've wrote) is admittedly fragile. It surely causes tricky issues. (E.g. https://stackoverflow.com/questions/12221646/how-to-call-cpuid-instruction-in-a-mac-framework).

Using the target attribute seems to be reasonable. Its support is actually not so recent-ish (from 4.8: https://gcc.gnu.org/wiki/FunctionMultiVersioning), and waiting Clang's support is acceptable. Even ARM NEON support would be possible in future :) Intel C Compiler also has a similar feature, so someone interested can write a patch to support it.

Regarding manual detection, the problem is there is no open-source simple, portable, easy-to-use and stable library for CPU detection. (https://wiki.linaro.org/LEG/Engineering/OPTIM/Assembly#Hardware_identification) Using __builtin_cpu_supports intrinsics is one possibility. It's support is in GCC 4.8.0 and Clang 3.9.0 or later. (Clang 3.7 and 3.8 had broken implementation: https://bugs.llvm.org/show_bug.cgi?id=25510). Numpy already uses this (https://github.com/numpy/numpy/blob/e0f884692f629d931d3673adc8dd6b0938cefa52/numpy/core/src/umath/cpuid.c). Intel C Compiler again has a similar feature and MSVC has a more primitive cpuid intrinsics, so supporting them is possible if desired.

QuLogic commented 5 years ago

Different functions in different .o files

If you want to go down this route, you might find some inspiration in pyfastnoisesimd. At build time, compiler support is queried here and the relevant files are enabled. At runtime, this function is used to determine support and the correct function is dispatched based on the results.

farleylai commented 3 years ago

It remains not straightforward to have Pillow-SIMD installed through dependencies (e.g. pytorch can only list Pillow). The README mentions Why do not contribute SIMD to the original Pillow but doesn't OpenCV have the same concern? It seems like OpenCV manages to distribute through anaconda repo with avx2 support through CPU dispatch code generation such that rebuilding on the target platform is not necessary. Is there any other difficulty preventing this kind of approach?

adam-azarchs commented 3 years ago

As stated before, CPU dispatch is tricky to get right in a platform-independent way. OpenCV is a much larger project with more resources to maintain the build apparatus for doing such things properly, in particular doing dynamic dispatch in a way that doesn't severely impact runtime performance.

If you were only targetting gcc, then the target_clones attribute would basically solve the problem for you. Otherwise things get a bit more complicated.

Both clang and gcc support the target attribute for compiling a given method with a different architecture from the rest of the code. So you can kind of fake target_clones by making e.g. foo_avx and foo_sse and so on, having each of them calling foo_base (using also the flatten to make sure that method gets inlined into the arch-specific versions), and then use the ifunc to set up the resolver for dynamic dispatch at load time. So it can be done, just a lot of yak shaving. In terms of manually writing the resolver, https://github.com/google/cpu_features might be overkill but does solve that problem for you.

QuLogic commented 2 years ago

NumPy now has a pretty comprehensive SIMD build and dispatch system, but I don't know how portable it would be to Pillow since it doesn't depend on NumPy.

hameerabbasi commented 3 months ago

I'll just drop https://github.com/google/highway in here.

uploadcare / pillow-simd