google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.1k stars 309 forks source link

Improving dynamic dispatch for multiple targets for x86-64/AArch64/PPC64 #1782

Open johnplatts opened 11 months ago

johnplatts commented 11 months ago

There are some dynamic dispatch scenarios that require compiling the same C++ source files more than once (but with different C++ flags for some of the compilation phases), such as x86-64 with MSVC if AVX2/AVX3 targets are enabled, AArch64 if SVE/SVE2 targets are enabled, or PPC if PPC8/PPC9/PPC10 targets are enabled.

Here are the compilation phases for multi-phase compilation with MSVC on x86-64:

Here are the compilation phases for multi-phase compilation for AArch64 with SVE/SVE2 enabled:

Here are the compilation phases for multi-phase compilation for PPC64:

There are real-world use cases for multiple compilation dynamic dispatch, including improved performance on PPC9/PPC10/AArch64.

johnplatts commented 11 months ago

Here is an example of Highway dynamic dispatch code updated to support multi-phase compilation (compiled more than once with different compiler options for the different compilation phases):

// Generates code for every target that this compiler can support.
#undef HWY_TARGET_INCLUDE
#define HWY_TARGET_INCLUDE "example.cpp"  // this file
#include <hwy/foreach_target.h>  // must come before highway.h
#include <hwy/highway.h>

HWY_BEFORE_NAMESPACE();
namespace project {
namespace HWY_NAMESPACE {  // required: unique per target

// Can skip hn:: prefixes if already inside hwy::HWY_NAMESPACE.
namespace hn = hwy::HWY_NAMESPACE;

using T = float;

void MulAddLoop(const T* HWY_RESTRICT mul_array,
                const T* HWY_RESTRICT add_array,
                const size_t size, T* HWY_RESTRICT x_array);

#if HWY_IN_PER_TARGET_PHASE
void MulAddLoop(const T* HWY_RESTRICT mul_array,
                const T* HWY_RESTRICT add_array,
                const size_t size, T* HWY_RESTRICT x_array) {
  const hn::ScalableTag<T> d;
  for (size_t i = 0; i < size; i += hn::Lanes(d)) {
    const auto mul = hn::Load(d, mul_array + i);
    const auto add = hn::Load(d, add_array + i);
    auto x = hn::Load(d, x_array + i);
    x = hn::MulAdd(mul, x, add);
    hn::Store(x, d, x_array + i);
  }
}
#endif

}  // namespace HWY_NAMESPACE
}  // namespace project
HWY_AFTER_NAMESPACE();

// The table of pointers to the various implementations in HWY_NAMESPACE must
// be compiled only once (foreach_target #includes this file multiple times).
// HWY_ONCE is true for only one of these 'compilation passes'.
#if HWY_ONCE && HWY_IN_DYN_DISPATCH_PHASE

namespace project {

// This macro declares a static array used for dynamic dispatch.
HWY_EXPORT(MulAddLoop);

void CallMulAddLoop(const float* HWY_RESTRICT mul_array,
                const float* HWY_RESTRICT add_array,
                const size_t size, float* HWY_RESTRICT x_array) {
  // This must reside outside of HWY_NAMESPACE because it references (calls the
  // appropriate one from) the per-target implementations there.
  // For static dispatch, use HWY_STATIC_DISPATCH.
  return HWY_DYNAMIC_DISPATCH(MulAddLoop)(mul_array, add_array, size, x_array);
}

}  // namespace project
#endif  // HWY_ONCE

Here is a link to the above example on Compiler Explorer that shows the above code compiled with different options for HWY_IN_PER_TARGET_PHASE/HWY_IN_DYN_DISPATCH_PHASE: https://gcc.godbolt.org/z/63xTfh1bj

jan-wassenberg commented 11 months ago

Nice, I understand we want to compile with differing compile flags. This makes sense for MSVC; one could argue that clang/gcc supersede MSVC even on Windows, but certainly MSVC is still being used. Even for clang/gcc we still have the situation that currently it's not possible to generate both SVE2 and SVE code, or RVV and scalar, or NEON vs NEON_WITHOUT_AES. My understanding is that this has actually been fixed for SVE in clang-16, but my distro doesn't have that package yet.

It seems reasonable to support something like this, at least as a stopgap. But one very important constraint: can we ensure that old code with the new headers still compiles?

Pflugshaupt commented 9 months ago

Is it possible to do dynamic dispatch across all targets with one step in Visual Studio when compiling with clang-cl, or does it have the same restrictions as the msvc compiler when it comes to vex code and thus would requite multiple compilation phases as described above?

jan-wassenberg commented 9 months ago

Hi @Pflugshaupt , we differentiate between HWY_COMPILER_MSVC and HWY_COMPILER_CLANGCL. I believe runtime dispatch would work with the latter, independently of whether invoked via Visual Studio or not.

Pflugshaupt commented 9 months ago

Thank you for your time and quick answer. It made me keep trying and I was able to find the true problem. I can confirm things work fine with visual studio driving clang-cl in general. But there appears to be an issue with templates.

The problems I am seeing come from using templates for DRY and avoiding branches inside loops using templates. It appears visual studio insists on always creating instantiations for templates even if they are fully inlined. Often these would be removed during linking, but they just don't compile in this special case. Combined with the multiple includes by the dynamic dispatch logic and changing compiler flags this seems to lead to disaster as there seems to be a mixup of namespaces, templates and compiler flags :(. Clearly this was not designed with changing compiler flags inside the same compile unit.

I keep getting "always_inline function 'Load' requires target feature 'ssse3' but would be inlined into function (..) that is compiled without support for ssse3", as soon as I use templates inside the HWY_NAMESPACE inside my own namespace and instantiate these from other functions inside the same namespace. The kind of template I'm using should be 100% inlined. These are just shortcuts for writing less code.

Maybe I'll find some magical compiler trick to get rid of the instatiation, but if not.. I'd probably still have to split everything into multiple compile units. And then I might as well not use clang-cl.

Pflugshaupt commented 9 months ago

Update: Just got it to work thanks to this: https://stackoverflow.com/questions/71720201/why-does-msvc-compiler-put-template-instantiation-binaries-in-assembly

However my solution (msvc 2022 + clang-cl) so far is somewhat inelegant and seems to defy logic. It requires

This seems to get rid of the troublesome instances as long as the template is only used in the same compile unit. Hopefully there's a simpler way.

jan-wassenberg commented 9 months ago

hm, the "requires target feature" usually means we are missing a pragma. It is important for all of your SIMD-using code to be between HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE: these set up a pragma to cover all 'functions' between them. Also, any lambdas require an extra HWY_ATTR before the opening { because lambdas do not count as 'functions'.

Is it possible that this could be an easier solution to the problem?

Pflugshaupt commented 9 months ago

Wow - thanks heaps! That was it! I was aware of HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE, but I was mixing lambdas and templates with lambda arguments to get as DRY as possible and adding HWY_ATTR to all lambdas has fixed the issues I was seeing on msvs + clang-cl.

Looking at the docs again I see that there's a HWY_ATTR in the Transform1 example on the main readme (which is similar to what I'm doing) and I unfortunately missed that. Hopefully this conversation helps someone else in the future.

Things compiled fine on macOS without HWY_ATTR before already.

jan-wassenberg commented 9 months ago

Nice, glad to hear that was it :)