Open llvmbot opened 7 years ago
Having parameterized target attributes as suggested in #7 would help a lot. AFAIK the only other alternative is to implement the 'portable' SIMD algorithms as a macro expanded inside two noinline functions with the SSE4 and AVX2 target attributes.
Apologies if my compiler foo is rusty. Is it possible to do the following:
given function f that is called under __invoke_with_attribute, clone it, allow all functions to be inlined into it and if one is chosen for inlining clone it as well (in case its address is taken). Do this transitively.
lib/Target/X86/X86TargetTransformInfo.cpp:bool X86TTIImpl::areInlineCompatible(const Function *Caller,
:)
We already do the #1.
For reference, that TTI hook is: bool areInlineCompatible(const Function Caller, const Function Callee) const { return (Caller->getFnAttribute("target-cpu") == Callee->getFnAttribute("target-cpu")) && (Caller->getFnAttribute("target-features") == Callee->getFnAttribute("target-features")); }
which the inliner gets to from: functionsHaveCompatibleAttributes()
Indeed. This could be better (we could allow inlining when the caller has a superset of the capabilities of the callee).
We already do the #1.
For reference, that TTI hook is: bool areInlineCompatible(const Function Caller, const Function Callee) const { return (Caller->getFnAttribute("target-cpu") == Callee->getFnAttribute("target-cpu")) && (Caller->getFnAttribute("target-features") == Callee->getFnAttribute("target-features")); }
which the inliner gets to from: functionsHaveCompatibleAttributes()
We already do the #1.
Function level attributes are amazingly easier to implement. The other isn't impossible, but the ramifications inside the compiler are unknown (and Intel's documentation really isn't clear here either).
The function-level attributes are easier to implement, but we either need to:
or we end up back with this more-general problem. Unfortunately, properly implementing this is going to end up touching a large fraction of the common passes (specifically, lots of MI-level passes -- there are now registers and instructions that can only be used in part of the CFG -- and IR users of TTI will need to make their queries CFG-dependent). Not a small project.
Function level attributes are amazingly easier to implement. The other isn't impossible, but the ramifications inside the compiler are unknown (and Intel's documentation really isn't clear here either).
It will be enough but the library will become less ergonomic.
Instead of adding a template param only we would have to add the attribute as well. If the dispatch point is high enough this means annotating the transitive closure of all functions.
I am guessing that the function level switch of features is easier to implement than basic block switch. If that's the case, something like this will work:
__invoke_with_attribute((xxx), fn, args);
The semantics are:
calls fn(args...) as if fn had attribute((xxx))
Would it be enough for your purposes if we extended attribute((target)) to allow stuff like this:
template
SelectCPU(BroadWordSearch, &words, 123);
If Clang was smart enough to take constexpr C strings as targets in addition to string literals, would that be workable?
Let me expand on what I am doing and why it is important.
I have a library approach to function multiversioning, which gives the developer the ability to build binaries that run on a collection of microarchs at close to optimal speeds while giving her full control for code size and amortization of dispatch cost, and at the same time allowing for maximum code sharing.
Please bear with the code; I tried to keep it to a minimum. I have put comments where each piece of the puzzle is implemented and how it all fits together:
https://gist.github.com/alkis/7fd9678e64ae885fd9a4135ee7411360
To see the benefits of this approach, assume that our higher level algorithm (or kernel) is BroadWordSearch. We want to find the position of the n'th 1 in a binary stream:
template
SelectCPU(BroadWordSearch, &words, 123);
So now we have a single version of the algorithm, BroadWordSearch, while the microarch specific bits are extracted out in a library. More importantly we can also choose to put the microarch dispatch at BroadWordSearch, or at any higher level function we want (to amortize the cost of dispatch). We can even put the dispatch inside a constructor and instantiate N different subclasses of a base effectively turning the table based dispatch into an indirect function call dispatch at will. We can also choose how many copies of the code we are going to have by changing the number of CPUs we are able to handle in SelectCPU(). Unfortunately, the compiler does not inline our ops nor does it compile our kernel assuming it will really run under the target we specify. With _allow_cpu_features intrinsic inside each case of the dispatch code, this becomes a possibility.
I think there is very little interest in tracking CPU feature availability in the control flow graph. We have function-level CPU feature tracking through attribute((target)), and I hope this is enough for most users.
The compiler should be able to (depending on port) inline functions with compatible features. There's a hook in the inliner to check this sort of thing.
This looks useful indeed. Is the compiler smart enough to inline functions with the same set of features? More generically can it inline functions for implied features? Specifically, if f calls g and f has feature bmi2 and g feature bmi, in practice g can always be inlined because there is no arch that supports bmi2 but not bmi.
I don't know. Eric, do you know?
This looks useful indeed. Is the compiler smart enough to inline functions with the same set of features? More generically can it inline functions for implied features? Specifically, if f calls g and f has feature bmi2 and g feature bmi, in practice g can always be inlined because there is no arch that supports bmi2 but not bmi.
Not exactly the same thing, but you might find this useful: http://clang.llvm.org/docs/AttributeReference.html#target-gnu-target
Extended Description
https://software.intel.com/en-us/articles/new-intrinsic-allow-cpu-features-support
This intrinsic, among other things, makes it possible to build composable components that are optimal for all microarchitectures in a single binary (something like function multi versioning but better).
Bonus points if all instructions dominated by a call to the intrinsic are also scheduled as per the best matching CPU:
if (skylake) { __allow_cpu_features(_FEATURE_BMI2); // do stuff // use bmi2 intrinsics here even if -march=X doesn't support them }
Code in "// do stuff, etc" should be tuned as if the cpu was haswell even if the rest of the code is tuned for westmere. The rationale is that haswell is the earliest processor supporting FEATURE_BMI2 and it is not introduced earlier than the per file processor.