Open johnplatts opened 1 year ago
Here is an example of Highway dynamic dispatch code updated to support multi-phase compilation (compiled more than once with different compiler options for the different compilation phases):
// Generates code for every target that this compiler can support.
#undef HWY_TARGET_INCLUDE
#define HWY_TARGET_INCLUDE "example.cpp" // this file
#include <hwy/foreach_target.h> // must come before highway.h
#include <hwy/highway.h>
HWY_BEFORE_NAMESPACE();
namespace project {
namespace HWY_NAMESPACE { // required: unique per target
// Can skip hn:: prefixes if already inside hwy::HWY_NAMESPACE.
namespace hn = hwy::HWY_NAMESPACE;
using T = float;
void MulAddLoop(const T* HWY_RESTRICT mul_array,
const T* HWY_RESTRICT add_array,
const size_t size, T* HWY_RESTRICT x_array);
#if HWY_IN_PER_TARGET_PHASE
void MulAddLoop(const T* HWY_RESTRICT mul_array,
const T* HWY_RESTRICT add_array,
const size_t size, T* HWY_RESTRICT x_array) {
const hn::ScalableTag<T> d;
for (size_t i = 0; i < size; i += hn::Lanes(d)) {
const auto mul = hn::Load(d, mul_array + i);
const auto add = hn::Load(d, add_array + i);
auto x = hn::Load(d, x_array + i);
x = hn::MulAdd(mul, x, add);
hn::Store(x, d, x_array + i);
}
}
#endif
} // namespace HWY_NAMESPACE
} // namespace project
HWY_AFTER_NAMESPACE();
// The table of pointers to the various implementations in HWY_NAMESPACE must
// be compiled only once (foreach_target #includes this file multiple times).
// HWY_ONCE is true for only one of these 'compilation passes'.
#if HWY_ONCE && HWY_IN_DYN_DISPATCH_PHASE
namespace project {
// This macro declares a static array used for dynamic dispatch.
HWY_EXPORT(MulAddLoop);
void CallMulAddLoop(const float* HWY_RESTRICT mul_array,
const float* HWY_RESTRICT add_array,
const size_t size, float* HWY_RESTRICT x_array) {
// This must reside outside of HWY_NAMESPACE because it references (calls the
// appropriate one from) the per-target implementations there.
// For static dispatch, use HWY_STATIC_DISPATCH.
return HWY_DYNAMIC_DISPATCH(MulAddLoop)(mul_array, add_array, size, x_array);
}
} // namespace project
#endif // HWY_ONCE
Here is a link to the above example on Compiler Explorer that shows the above code compiled with different options for HWY_IN_PER_TARGET_PHASE/HWY_IN_DYN_DISPATCH_PHASE: https://gcc.godbolt.org/z/63xTfh1bj
Nice, I understand we want to compile with differing compile flags. This makes sense for MSVC; one could argue that clang/gcc supersede MSVC even on Windows, but certainly MSVC is still being used. Even for clang/gcc we still have the situation that currently it's not possible to generate both SVE2 and SVE code, or RVV and scalar, or NEON vs NEON_WITHOUT_AES. My understanding is that this has actually been fixed for SVE in clang-16, but my distro doesn't have that package yet.
It seems reasonable to support something like this, at least as a stopgap. But one very important constraint: can we ensure that old code with the new headers still compiles?
Is it possible to do dynamic dispatch across all targets with one step in Visual Studio when compiling with clang-cl, or does it have the same restrictions as the msvc compiler when it comes to vex code and thus would requite multiple compilation phases as described above?
Hi @Pflugshaupt , we differentiate between HWY_COMPILER_MSVC and HWY_COMPILER_CLANGCL. I believe runtime dispatch would work with the latter, independently of whether invoked via Visual Studio or not.
Thank you for your time and quick answer. It made me keep trying and I was able to find the true problem. I can confirm things work fine with visual studio driving clang-cl in general. But there appears to be an issue with templates.
The problems I am seeing come from using templates for DRY and avoiding branches inside loops using templates. It appears visual studio insists on always creating instantiations for templates even if they are fully inlined. Often these would be removed during linking, but they just don't compile in this special case. Combined with the multiple includes by the dynamic dispatch logic and changing compiler flags this seems to lead to disaster as there seems to be a mixup of namespaces, templates and compiler flags :(. Clearly this was not designed with changing compiler flags inside the same compile unit.
I keep getting "always_inline function 'Load' requires target feature 'ssse3' but would be inlined into function (..) that is compiled without support for ssse3", as soon as I use templates inside the HWY_NAMESPACE inside my own namespace and instantiate these from other functions inside the same namespace. The kind of template I'm using should be 100% inlined. These are just shortcuts for writing less code.
Maybe I'll find some magical compiler trick to get rid of the instatiation, but if not.. I'd probably still have to split everything into multiple compile units. And then I might as well not use clang-cl.
Update: Just got it to work thanks to this: https://stackoverflow.com/questions/71720201/why-does-msvc-compiler-put-template-instantiation-binaries-in-assembly
However my solution (msvc 2022 + clang-cl) so far is somewhat inelegant and seems to defy logic. It requires
This seems to get rid of the troublesome instances as long as the template is only used in the same compile unit. Hopefully there's a simpler way.
hm, the "requires target feature" usually means we are missing a pragma. It is important for all of your SIMD-using code to be between HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE: these set up a pragma to cover all 'functions' between them. Also, any lambdas require an extra HWY_ATTR before the opening { because lambdas do not count as 'functions'.
Is it possible that this could be an easier solution to the problem?
Wow - thanks heaps! That was it! I was aware of HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE, but I was mixing lambdas and templates with lambda arguments to get as DRY as possible and adding HWY_ATTR to all lambdas has fixed the issues I was seeing on msvs + clang-cl.
Looking at the docs again I see that there's a HWY_ATTR in the Transform1 example on the main readme (which is similar to what I'm doing) and I unfortunately missed that. Hopefully this conversation helps someone else in the future.
Things compiled fine on macOS without HWY_ATTR before already.
Nice, glad to hear that was it :)
There are some dynamic dispatch scenarios that require compiling the same C++ source files more than once (but with different C++ flags for some of the compilation phases), such as x86-64 with MSVC if AVX2/AVX3 targets are enabled, AArch64 if SVE/SVE2 targets are enabled, or PPC if PPC8/PPC9/PPC10 targets are enabled.
Here are the compilation phases for multi-phase compilation with MSVC on x86-64:
Here are the compilation phases for multi-phase compilation for AArch64 with SVE/SVE2 enabled:
Here are the compilation phases for multi-phase compilation for PPC64:
There are real-world use cases for multiple compilation dynamic dispatch, including improved performance on PPC9/PPC10/AArch64.