shibatch / sleef

SIMD Library for Evaluating Elementary Functions, vectorized libm and DFT
https://sleef.org
Boost Software License 1.0
631 stars 128 forks source link

Sleef functions can't be inlined #230

Open colesbury opened 5 years ago

colesbury commented 5 years ago

In PyTorch, we'd like to use Sleef's vectorized implementation of elementary functions as building blocks. For example, we'd like to implement a vectorized sigmoid() function using exp. However, calling into Sleef's exp() is expensive because it incurs the cost of a non-inlineable function call.

It would be great if Sleef provided the instruction-set specific vectorized functions in a header file, or some other way that can be inlined by the compiler.

fpetrogalli commented 5 years ago

Hi @colesbury , what compiler do you use to build PyTorch? I am asking because SLEEF can be built as an LLVM bit code library. If you rely on clang in your tool-chain, it should be possible to inline all SLEEF functions.

colesbury commented 5 years ago

We support GCC, clang, and MSVC. We can't rely on LTO.

fpetrogalli commented 5 years ago

We support GCC, clang, and MSVC. We can't rely on LTO.

The clang mechanism wouldn't rely on LTO. In any case, LLVM bitcode is not viable because you have to support also GCC and MSVC.

Would adding sigmoid() to SLEEF itself help your specific case? Or do you have many functions you would like to add?

colesbury commented 5 years ago

Would adding sigmoid() to SLEEF itself help your specific case? Or do you have many functions you would like to add?

There's many other functions. (Some are listed here).

Even if Sleef provided a sigmoid() implementation, there would still be an overhead issue because Sleef calls are per-vector element (and we're operating on large arrays).

Libraries like MKL's VML avoid the overhead issue, but they don't compose well.

shibatch commented 5 years ago

Adding VML-style API with all math functions inlined is not impossible, but it would take some time to implement.

shibatch commented 5 years ago

@colesbury I wrote a prototype of SleefVML. Please check my gist and let me know if this is what you want.

https://gist.github.com/shibatch/70f48e153ecc9d586386dc1676d564cc

In this file, a few activation functions derived from the Wikipedia entry are implemented. https://en.wikipedia.org/wiki/Activation_function It is pretty easy to add other functions.

These functions can be computed with a VML-style API.

void SleefVML_vmsSigmoid(size_t n, float *arg0, float *ret, uint64_t mode);

All the functions are vectorized. The vector extension to be used can be easily changed.

g++ -fopenmp -Wno-attributes -I./include -L./lib -march=avx2 -O3 sleefvml_test.cpp

For example, the vectorized assembly output of the sigmoid function is as follows.

.L50:
        vmovups 0(%r13,%r15), %xmm0
        vinsertf128     $0x1, 16(%r13,%r15), %ymm0, %ymm0
        vmovaps %ymm1, -80(%rbp)
        vxorpd  .LC3(%rip), %ymm0, %ymm0
        call    Sleef_expf8_u10avx2
        vmovaps -80(%rbp), %ymm1
        vaddps  %ymm1, %ymm0, %ymm0
        vdivps  %ymm0, %ymm1, %ymm0
        vmovups %xmm0, (%r12,%r15)
        vextractf128    $0x1, %ymm0, 16(%r12,%r15)
        addq    $32, %r15
        cmpq    %r14, %r15
        jne     .L50

In the assembly code, a call to Sleef_expf8_u10avx2 is generated. This call can be inlined with LTO.

The above code is just a prototype. The final code will have dispatchers.

colesbury commented 5 years ago

@shibatch this may be useful for other people, but it doesn't solve the problem I wrote about. The problem is the call to Sleef_expf8_u10avx2. MKL-style APIs do not compose well and therefore don't solve the problem I'm talking about.

For our use-case, we can't rely on LTO or LLVM bitcode.

I'd like Sleef_expf8_u10avx2 to be available inline in a instruction-set specific header so that we can include calls to it without incurring the cost of a function call. We don't need Sleef dispatchers, because we use our own dispatcher earlier in the program execution.

shibatch commented 5 years ago

The VML-style functions will be a part of SLEEF, so LTO can be used when building SLEEF library, not pytorch. Is the cost of a function call really a problem? I think inlining the function does not speed up the execution so much. Could you tell me which part of the source code in pytorch you are having the problem?

colesbury commented 5 years ago

VML-style functions are not a good solution for us because they don't compose well. We have potentially many functions and (their derivatives) that we may want to vectorize. sigmoid was just one example.

shibatch commented 5 years ago

I don't understand what you mean by compose well.

I wrote the code in such a way that adding functions is very easy. In the prototype code, binary step, ISRU, soft expoenential and soft clipping functions are implemented in addition to sigmoid.

extern "C" void SleefVML_vmsBinaryStep(size_t n, float *arg0, float *ret, uint64_t mode) {
  float *args[] = { arg0 };
  executef(mode, n, args, ret, sellt(load(0), constant(0), constant(0), constant(1)));
}

extern "C" void SleefVML_vmsSigmoid(size_t n, float *arg0, float *ret, uint64_t mode) {
  float *args[] = { arg0 };
  executef(mode, n, args, ret, 1.0 / (1.0 + Sleef_exp_u10(-load(0))));
}

extern "C" void SleefVML_vmsSoftSign(size_t n, float *arg0, float *ret, uint64_t mode) {
  float *args[] = { arg0 };
  executef(mode, n, args, ret, load(0) / (1.0 + abs(load(0))));
}

extern "C" void SleefVML_vmsISRU(size_t n, float *arg0, double alpha, float *ret, uint64_t mode) {
  float *args[] = { arg0 };
  executef(mode, n, args, ret, load(0) / sqrt(1.0 + alpha * (load(0) * load(0))));
}

extern "C" void SleefVML_vmsSoftExponential(size_t n, float *arg0, double alpha, float *ret, uint64_t mode) {
  float *args[] = { arg0 };
  if (alpha < 0) {
    executef(mode, n, args, ret, -Sleef_log_u10(1.0-alpha*(load(0)+alpha))/alpha);
  } else {
    executef(mode, n, args, ret, (Sleef_exp_u10(alpha*load(0))-1.0)/alpha + alpha);
  }
}

extern "C" void SleefVML_vmsSoftClipping(size_t n, float *arg0, double alpha, float *ret, uint64_t mode) {
  float *args[] = { arg0 };
  executef(mode, n, args, ret, Sleef_log_u10((1.0+Sleef_exp_u10(alpha*load(0)))/(1.0+Sleef_exp_u10(alpha*(load(0)-1.0))))/alpha);
}
colesbury commented 5 years ago

Are those examples intended to show how you would write the functions as part of the Sleef library? Or as a user of the Sleef library?

I'm not interested in implementing all the high-level functions that we care about as part of the Sleef library. There are too many, and some require more complicated reduction semantics like softmax.

We can't write functions in that style as a user of the Sleef library because of the overhead of a function call per SIMD vector (without LTO).

By VML-style API, I mean something that operates on a "large" array of data instead of a single "short" vector. VML-style APIs don't compose well because you can't efficiently write something like SleefVML_vmsSigmoid from other VML-style function calls. You can only efficiently write it using things like Sleef_exp_u10, which is not a VML-style API.

colesbury commented 5 years ago

Is the cost of a function call really a problem? I think inlining the function does not speed up the execution so much.

In my experience, the non-inlineable call to something like Sleef_exp_u10 per SIMD vector makes something like the sigmoid calculation much slower (I think it was ~3x slower).

shibatch commented 5 years ago

Are those examples intended to show how you would write the functions as part of the Sleef library? Or as a user of the Sleef library?

I haven't decided the detailed design. My idea is to define all NN-related functions inside SleefVML library. It is possible to make the definition of C++ classes for writing VML-style functions usable by the users, but in that case, it is difficult to apply LTO.

Are all the functions to compute known before execution of the program? Or are they decided dynamically during execution?

In my experience, the non-inlineable call to something like Sleef_exp_u10 per SIMD vector makes something like the sigmoid calculation much slower (I think it was ~3x slower).

That sounds too much slowing down. I would expect less than 30% of slow down. If you are trying to evaluate the sigmoid function using many calls to VML-style functions, then 3x computation time makes sense. If this is the case, the main thing that is taking time is not the overhead of function call, but the time to traverse the same memory region again and again.

chriselrod commented 5 years ago

I've been playing around with optimizing some code. I've been maintaining both a Julia version, and a C++ version. I had been using SLEEF in both, but I recently switched to a fork of a Julia-port of SLEEF version 2.

Forcing inlining of the elementary functions resulted in >50% speed up of the slowest loop in the code, and roughly 25% improvement in speed of the entire program.

There are 9 constants across the loop iterations. But the compiler would push these to the stack, and reference / load them on each iteration of the loop, rather than keeping them in the registers. There are also of course constants in the elementary functions themselves, eg the polynomial kernels..

Computers with avx512 (like the ones I've been benchmarking on) have 32 registers. They can hold quite a bit of numbers in their registers across loop iterations. But even with just 16 registers, inlining can save a lot of load operations between loop iterations.

I would like to be able to inline these calls in the C++ code, too. While I have been compiling with both gcc and Clang and would have a preference for supporting both compilers, I could use a Clang-only solution. Clang was producing slightly faster code already. Any guide / instructions on building and linking SLEEF as an LLVM bitcode library?

I followed these instructions for adding SLEEF to the project: https://sleef.org/compile.xhtml#import

Alternatively, lto would also work for me. I did set SLEEF to build as a static library. Googling did find some information on successfully compiling static libraries to support lto, but I'd have to figure out how to get that to agree with cmake.

shibatch commented 5 years ago

Any guide / instructions on building and linking SLEEF as an LLVM bitcode library?

@xoofx Could you give us a bit of comment about this?

shibatch commented 5 years ago

It seems that this is good time to add official support for generating a bitcode library. A few things to consider is

So this may or may not be easy.

xoofx commented 5 years ago

Any guide / instructions on building and linking SLEEF as an LLVM bitcode library?

Today, there is only a mode where you can produce separated LLVM bitcode files for each variations. In order to enable it you need to pass SLEEF_ENABLE_LLVM_BITCODE to true to cmake and run it from a platform that has the CLANG_EXE_PATH (currently it tries "clang-5.0" "clang-4.0" "clang-3.9")

I have a branch also that is not yet merged to SLEEF to allow cross compiling SLEEF for different CPU in a single run.

It is better to support both llvm and gcc.

I don't know anything about GCC bitcode and if it is supported at all there

I don't know about compatibility between bitcode generated by different versions of llvm.

They are not always compatible, to it can break easily with a new version.

I don't know how good bitcode is supported by cmake and other tools.

Mainly on our own, that's what I'm doing in libm, considering that the code is relatively small there for handling LLVM, it should not be an issue

shibatch commented 5 years ago

Another possibility is to include a JIT compiler utilizing LLVM. If people like this idea, I will seriously consider this direction.

xoofx commented 5 years ago

Another possibility is to include a JIT compiler utilizing LLVM. If people like this idea, I will seriously consider this direction.

I would not go that route, as I don't think that anybody would want to have a JIT in their runtime.

You can produce LTO object files with clang and the folks would only have to compile their project with SLEEF directly. But I would stay away at trying to provide LTO binaries. This is a compile time problem that depends on an installed compiler/linker toolchain that needs to be compatible.

chriselrod commented 5 years ago

Regarding my problem, for now I just split the loop body into pieces. Looping over each piece separately allows the invariants to be kept on registers, speeding things up nicely. But I'm guessing inlining the SLEEF functions would still provide additional speed up, letting the constants it uses also be loaded only once per loop.

SLEEF_ENABLE_LLVM_BITCODE

Okay, I will experiment with that. I'm not very familiar with cmake or build systems in general.

I'm guessing I will have to do something different from:

add_dependencies(MY_LIBRARY libsleef)
target_link_libraries(MY_LIBRARY sleef)

to get cmake to build my_library with the sleef llvm bitcode files?

For example, the cmake documentation says if you compile a library simply into object files, you need

add_library(... $<TARGET_OBJECTS:objlib> ...)

When compiling an object library, you'd get a bunch of *.o files instead of a single libsleef.so/libsleef.a. If compiling into bitcode, I imagine you're similarly getting a bunch of *.ll files, and will need to handle combining with your library similarly?

EDIT: Actually, it looks like calling this an object file is exactly what I should do?

The OBJECT library type defines a non-archival collection of object files resulting from compiling the given source files. The object files collection may be used as source inputs to other targets:


add_library(archive OBJECT archive.cpp zip.cpp lzma.cpp)

add_library(archiveExtras STATIC $ extras.cpp)

add_executable(test_exe $ test.cpp)

xoofx commented 5 years ago

Using SLEEF_ENABLE_LLVM_BITCODE is only for generating bitcode files, but in your case, you probably just need to configure SLEEF to compile with LTO flags

chriselrod commented 5 years ago

I had been building SLEEF into a static library, and then linking it. A shared library obviously wouldn't work.

However, from the sound of things, to get LTO to work with static libraries with gcc would require setting several cmake variables for SLEEF's build:

SET(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -flto")
SET(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -flto")
SET(CMAKE_AR  "gcc-ar")
SET(CMAKE_C_ARCHIVE_CREATE "<CMAKE_AR> qcs <TARGET> <LINK_FLAGS> <OBJECTS>")
SET(CMAKE_C_ARCHIVE_FINISH   true)

A short test seems to suggest Clang has the same undefined reference problems (although I'd probably use llvm-ar instead for Clang?). Tomorrow, I'll try passing these in to the external_project as cmake args, like -DCMAKE_AR="gcc_ar".

If SLEEF could be compiled as an object library, that'd probably be easier. That's why I thought the bitcode library could also be simpler (although I'd ideally I'd support both clang and gcc).

Is there a simpler approach I'm missing?

shibatch commented 5 years ago

Computers with avx512 (like the ones I've been benchmarking on) have 32 registers. They can hold quite a bit of numbers in their registers across loop iterations. But even with just 16 registers, inlining can save a lot of load operations between loop iterations.

How about this: I will define a data type that contains like 8 vector registers. And, each function call evaluates the values in these 8 registers at a time. Then, the overhead of loading constants will be 1/8.

colesbury commented 5 years ago

@shibatch Would you be willing to accept a patch that moves SIMD functions (sleefsimdsp, sleefsimddp) into a header file?

I have a tried this strategy out here: https://github.com/colesbury/sleef/tree/sleef_header. The header contains all the static INLINE functions. The C file has all the exported functions and still handles the renames. The <sleef.h> API is not changed.

This works well for PyTorch. Since we have our own CPU dispatcher we can include sleefsimdsp.h directly and the function calls to Sleef are inlined.

#define ENABLE_AVX2
#include <sleef/sleefsimdsp.h>

void test_function() {
  float data[8] = { 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
  __m256 value = _mm256_loadu_ps(data);
  value = vsinf_u1(value);
  _mm256_storeu_ps(data, value);
}

I would be happy to make any other necessary changes.

shibatch commented 5 years ago

@fpetrogalli How do you think?

colesbury commented 5 years ago

@shibatch @fpetrogalli thoughts?

shibatch commented 5 years ago

@colesbury Could you give me an e-mail?

dnbaker commented 4 years ago

It would be great if these calls could be inlined. I've been working with the blaze linear algebra project to incorporate Sleef into its expression templates. It works currently, but assumes the model of header + dynamically linked via libsleef.

For my applications, I'm seeing that simple multiplication is not much faster than the sleef log functions (2-3x, compared to 4-5x for std::log), and I imagine inlining would make a difference for functions that are that quick.

What kind of changes to compilation time would a header-only version of the library cause?

shibatch commented 4 years ago

Actually I haven't started working on this. It may require a dramatic change in the structure of the source code. So, I have no idea at this point.

chriselrod commented 4 years ago

@dnbaker Because you're already using C++, have you considered xsimd?

dnbaker commented 4 years ago

Thanks for the pointer, I'll give it a look/benchmark. I've heard of it but haven't tried.

chriselrod commented 4 years ago

I'd be interested in your results. I'm not happy with what I've been doing in Julia, which is using a port of a Julia port of an outdated version of SLEEF, and on Linux checking for libmvec.so (comes with recent versions of GLIBC) because I found it to be fast, although their selection of functions is unfortunately quite small.

The Julia versions can be inlined (when called from Julia), but of course the shared library cannot.

shibatch commented 4 years ago

Have you tried LTO?

dnbaker commented 4 years ago

Can -flto inline functions from a dynamically-linked library? (I If there was a way to make a .a archive, I think gcc could, but I'm not familiar enough with CMake to make the changes to create that (alongside the .so/.dylib).

At least with gcc, -flto produces an identical binary.

shibatch commented 4 years ago

LLVM bitcode can be generated by specifying -DSLEEF_ENABLE_LLVM_BITCODE=TRUE as a cmake option.

dnbaker commented 4 years ago

I'm not seeing a measurable performance difference between the code compiled with the llvm bitcode and and dynamically linked, so at least for logarithms, I wouldn't consider this a short-term need.

chriselrod commented 4 years ago

LLVM bitcode can be generated by specifying -DSLEEF_ENABLE_LLVM_BITCODE=TRUE as a cmake option.

FWIW, copy and pasting the bitcode into llvmcall works well when the functions are inlined into a single function. Julia's llvmcall lets you declare functions your code calls, like @llvm.fma.v8f64, but not define them. This means lgamma is problematic, because lgamma calls qgamma, which was not inlined in the llvm .ll files. Maybe I can try messing with Clang's inlining threshold to see if I can fix this. I'd rather do that than manually inlining qgamma, because I'm not about to manually rename all the SSA variable names.

Also, interestingly, the AVX512 u35 erf function was slower than the u15 version.

shibatch commented 4 years ago

erf and gamma are not optimized properly.

shibatch commented 4 years ago

I have been working on this issue, and I found that this is not hard as I thought. Please see the new issue.

https://github.com/shibatch/sleef/issues/282