llvmbot commented 7 years ago


Bugzilla Link	30634
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@echristo,@hfinkel,@RKSimon,@rnk,@rotateright,@ZviRackover

Extended Description

https://software.intel.com/en-us/articles/new-intrinsic-allow-cpu-features-support

This intrinsic, among other things, makes it possible to build composable components that are optimal for all microarchitectures in a single binary (something like function multi versioning but better).

Bonus points if all instructions dominated by a call to the intrinsic are also scheduled as per the best matching CPU:

if (skylake) { __allow_cpu_features(_FEATURE_BMI2); // do stuff // use bmi2 intrinsics here even if -march=X doesn't support them }

Code in "// do stuff, etc" should be tuned as if the cpu was haswell even if the rest of the code is tuned for westmere. The rationale is that haswell is the earliest processor supporting FEATURE_BMI2 and it is not introduced earlier than the per file processor.

llvmbot commented 7 years ago

7 sounds very useful for a current SIMD pain point: generating AVX2 and SSE4 implementations from the same source code. We could use a thin template wrapper over intrinsics for that, but each calling function and the template itself would require target attributes.

Having parameterized target attributes as suggested in #7 would help a lot. AFAIK the only other alternative is to implement the 'portable' SIMD algorithms as a macro expanded inside two noinline functions with the SSE4 and AVX2 target attributes.

llvmbot commented 7 years ago

Apologies if my compiler foo is rusty. Is it possible to do the following:

given function f that is called under __invoke_with_attribute, clone it, allow all functions to be inlined into it and if one is chosen for inlining clone it as well (in case its address is taken). Do this transitively.

echristo commented 7 years ago

lib/Target/X86/X86TargetTransformInfo.cpp:bool X86TTIImpl::areInlineCompatible(const Function *Caller,

:)

hfinkel commented 7 years ago

We already do the #1.

For reference, that TTI hook is: bool areInlineCompatible(const Function Caller, const Function Callee) const { return (Caller->getFnAttribute("target-cpu") == Callee->getFnAttribute("target-cpu")) && (Caller->getFnAttribute("target-features") == Callee->getFnAttribute("target-features")); }

which the inliner gets to from: functionsHaveCompatibleAttributes()

Indeed. This could be better (we could allow inlining when the caller has a superset of the capabilities of the callee).

rotateright commented 7 years ago

We already do the #1.

For reference, that TTI hook is: bool areInlineCompatible(const Function Caller, const Function Callee) const { return (Caller->getFnAttribute("target-cpu") == Callee->getFnAttribute("target-cpu")) && (Caller->getFnAttribute("target-features") == Callee->getFnAttribute("target-features")); }

which the inliner gets to from: functionsHaveCompatibleAttributes()

echristo commented 7 years ago

We already do the #1.

hfinkel commented 7 years ago

Function level attributes are amazingly easier to implement. The other isn't impossible, but the ramifications inside the compiler are unknown (and Intel's documentation really isn't clear here either).

The function-level attributes are easier to implement, but we either need to:

Forbid inlining when doing so would drop target capabilities
Allow inlining, but drop target capabilities when doing so

or we end up back with this more-general problem. Unfortunately, properly implementing this is going to end up touching a large fraction of the common passes (specifically, lots of MI-level passes -- there are now registers and instructions that can only be used in part of the CFG -- and IR users of TTI will need to make their queries CFG-dependent). Not a small project.

echristo commented 7 years ago

Function level attributes are amazingly easier to implement. The other isn't impossible, but the ramifications inside the compiler are unknown (and Intel's documentation really isn't clear here either).

llvmbot commented 7 years ago

It will be enough but the library will become less ergonomic.

Instead of adding a template param only we would have to add the attribute as well. If the dispatch point is high enough this means annotating the transitive closure of all functions.

I am guessing that the function level switch of features is easier to implement than basic block switch. If that's the case, something like this will work:

__invoke_with_attribute((xxx), fn, args);

The semantics are:

calls fn(args...) as if fn had attribute((xxx))

rnk commented 7 years ago

Would it be enough for your purposes if we extended attribute((target)) to allow stuff like this:

template uint64_t attribute((target(Ops::cpu))) BroadWordSearch(Ops, uint64_t* words, uint64_t rank) { ... }

SelectCPU(BroadWordSearch, &words, 123);

If Clang was smart enough to take constexpr C strings as targets in addition to string literals, would that be workable?

llvmbot commented 7 years ago

Let me expand on what I am doing and why it is important.

I have a library approach to function multiversioning, which gives the developer the ability to build binaries that run on a collection of microarchs at close to optimal speeds while giving her full control for code size and amortization of dispatch cost, and at the same time allowing for maximum code sharing.

Please bear with the code; I tried to keep it to a minimum. I have put comments where each piece of the puzzle is implemented and how it all fits together:

https://gist.github.com/alkis/7fd9678e64ae885fd9a4135ee7411360

To see the benefits of this approach, assume that our higher level algorithm (or kernel) is BroadWordSearch. We want to find the position of the n'th 1 in a binary stream:

template uint64_t BroadWordSearch(Ops, uint64_t words, uint64_t rank) { uint64_t index = 0; while (true) { auto ones = Ops::popcnt64(word); if (rank < ones) break; rank -= ones; index += 64; ++word; } return Ops::select64(*word, rank); }

SelectCPU(BroadWordSearch, &words, 123);

So now we have a single version of the algorithm, BroadWordSearch, while the microarch specific bits are extracted out in a library. More importantly we can also choose to put the microarch dispatch at BroadWordSearch, or at any higher level function we want (to amortize the cost of dispatch). We can even put the dispatch inside a constructor and instantiate N different subclasses of a base effectively turning the table based dispatch into an indirect function call dispatch at will. We can also choose how many copies of the code we are going to have by changing the number of CPUs we are able to handle in SelectCPU(). Unfortunately, the compiler does not inline our ops nor does it compile our kernel assuming it will really run under the target we specify. With _allow_cpu_features intrinsic inside each case of the dispatch code, this becomes a possibility.

rnk commented 7 years ago

I think there is very little interest in tracking CPU feature availability in the control flow graph. We have function-level CPU feature tracking through attribute((target)), and I hope this is enough for most users.

echristo commented 7 years ago

The compiler should be able to (depending on port) inline functions with compatible features. There's a hook in the inliner to check this sort of thing.

hfinkel commented 7 years ago

This looks useful indeed. Is the compiler smart enough to inline functions with the same set of features? More generically can it inline functions for implied features? Specifically, if f calls g and f has feature bmi2 and g feature bmi, in practice g can always be inlined because there is no arch that supports bmi2 but not bmi.

I don't know. Eric, do you know?

llvmbot commented 7 years ago

This looks useful indeed. Is the compiler smart enough to inline functions with the same set of features? More generically can it inline functions for implied features? Specifically, if f calls g and f has feature bmi2 and g feature bmi, in practice g can always be inlined because there is no arch that supports bmi2 but not bmi.

hfinkel commented 7 years ago

Not exactly the same thing, but you might find this useful: http://clang.llvm.org/docs/AttributeReference.html#target-gnu-target

llvm / llvm-project

add _allow_cpu_features intrinsic (like the intel compiler) #29982

Extended Description

7 sounds very useful for a current SIMD pain point: generating AVX2 and SSE4 implementations from the same source code. We could use a thin template wrapper over intrinsics for that, but each calling function and the template itself would require target attributes.

llvm / llvm-project

add _allow_cpu_features intrinsic (like the intel compiler) #29982

Extended Description

​7 sounds very useful for a current SIMD pain point: generating AVX2 and SSE4 implementations from the same source code. We could use a thin template wrapper over intrinsics for that, but each calling function and the template itself would require target attributes.

7 sounds very useful for a current SIMD pain point: generating AVX2 and SSE4 implementations from the same source code. We could use a thin template wrapper over intrinsics for that, but each calling function and the template itself would require target attributes.