llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.51k stars 11.3k forks source link

llvm.sin.v4f32 with -vector-library=SVML fails to lower to SVM function call #37876

Open 54aefcd4-c07d-4252-8441-723563c8826f opened 6 years ago

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago
Bugzilla Link 38528
Version trunk
OS All
CC @chandlerc,@hfinkel,@RKSimon,@rotateright

Extended Description

Given

define <4 x float> @​sin_4x32(<4 x float> %a) { %b = tail call <4 x float> @​llvm.sin.v4f32(<4 x float> %a) ret <4 x float> %b } declare <4 x float> @​llvm.sin.v4f32(<4 x float>)

this always produces a sequence of 4 libm sinf calls when calling opt and/or llc with -vector-library=SVML .

Instead, this should just lower to a call to the appropriate SVML (or accelerate, libmvec, ...) instruction instead.

rotateright commented 6 years ago

Possibly related to this bug - I posted a legalization proposal to try to fix bug 38527 here: https://reviews.llvm.org/D50791

hfinkel commented 6 years ago

I don't think it will change that, unless we also start using this infrastructure to handle scalar calls too.

How are the scalar calls handled different?

I mean, LLVM still has to replace them will calls to libm, or whatever the systems library for this is. Is this library configurable just like the vector-lib is? If so, I would expect both situations to be fairly similar. Or what is the difference and why do they have to be handled differently ?

They're not handled the same way today, and no, the mechanism for scalar calls is tied to the target triple (to provide the fall-back function-call lowering) and other target features (to handle cases where the operations are legal or have some backend-provided lowering), and thus not configurable in the same way as the vector-lib state is.

We could handle them in the same way, but that just implies using the vector-lib infrastructure in SDAG. This might indeed be a good idea.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

I don't think it will change that, unless we also start using this infrastructure to handle scalar calls too.

How are the scalar calls handled different?

I mean, LLVM still has to replace them will calls to libm, or whatever the systems library for this is. Is this library configurable just like the vector-lib is? If so, I would expect both situations to be fairly similar. Or what is the difference and why do they have to be handled differently ?

hfinkel commented 6 years ago

For reference, some links to the last SVML-related llvm-dev discussion: https://groups.google.com/forum/#!topic/llvm-dev/sSnAen0qbiQ http://lists.llvm.org/pipermail/llvm-dev/2018-June/124357.html http://lists.llvm.org/pipermail/llvm-dev/2018-July/124393.html

I still don't have a good understanding of all the details, but a late IR pass to translate intrinsics/libcalls to (other) libcalls seems straightforward.

If that lets us remove some of the existing code from SelectionDAGLegalize::ConvertNodeToLibcall(), even better?

I don't think it will change that, unless we also start using this infrastructure to handle scalar calls too.

rotateright commented 6 years ago

For reference, some links to the last SVML-related llvm-dev discussion: https://groups.google.com/forum/#!topic/llvm-dev/sSnAen0qbiQ http://lists.llvm.org/pipermail/llvm-dev/2018-June/124357.html http://lists.llvm.org/pipermail/llvm-dev/2018-July/124393.html

I still don't have a good understanding of all the details, but a late IR pass to translate intrinsics/libcalls to (other) libcalls seems straightforward.

If that lets us remove some of the existing code from SelectionDAGLegalize::ConvertNodeToLibcall(), even better?

hfinkel commented 6 years ago

Yea, the infrastructure here is used by the vectorizer. The problem here is actually one of generality. We can't scale this process by creating intrinsics for every possible call. We do already have some intrinsics, for sin for example, because we specifically optimize those functions, but there are many more vector math functions that exist than those for which we have intrinsics.

Having the intrinsics lowering use the vector-math-library infrastructure in order to call SVML, etc. when enabled makes perfect sense to me too, but no one has written that code. You'd probably want a late IR pass so that it can work with both FastISel, GISel, and SDAG, but I recommend an RFC to flesh out the design.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Also, after going through some patches, mailing list, etc. trying to figure out what it was that I was doing wrong, it wasn't even clear to me which LLVM component is at fault here, and everything seemed to point out that the Rust front-end was doing it wrong.

IIUC, the backends don't want anything to do with short vector math libraries AFAICT, and since there are no loops in the IR provided, the loop vectorizer does not apply, so...

It is actually the job of every LLVM frontend to... replace generic vector calls with calls to short vector math libraries when profitable ? That's what it seems, but it just feels so wrong.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Yes. The Rust portable packed SIMD vector library allows Rust programs to manipulate packed vectors portably. It is not a SPMD programming model, and people use it to write a lot of vectorized code that has no vectorizable loops like crypto algorithms, regex engines, utf8 parsers, etc. (although some people use it to emulate SPMD, we explicitly discourage that since something like ISPC would be better).

It generates this IR for the mathematical floating-point vector methods (not just sin, but cos, sqrt, exp, ...). And since all of this is translated to scalar code, the performance is bad.

My mental model of how this was supposed to work in LLVM was completely wrong. I though that the front-ends would generate either scalar code, or LLVM-IR generic vector code, and that then the loop and scalar code vectorizers would, with knowledge of the available short vector math libraries, and target information from the backend, transform this generic vector code further, so that at the end the backend replaces these with calls to SVML or libmvec where profitable.

Instead, what I am actually already doing, is wrapping SVML and libmvec in Rust directly, and calling these directly myself from the front end.

For v2f32 sin, I expand it to a v4f32, call an SVML function, and then shuffle the first two elements out. I have to do this with only the architecture and target feature information, and have no idea how this will interact with optimizations. If I had to bet, I'd say it will inhibit all of them, so that sinf(v2f32(0.0, 0.0)) won't be a no-op, but instead, will call SVML unconditionally.

I think the approach I am currently following is therefore horrible, and am very much open to better suggestions about how to work around this bug.

rotateright commented 6 years ago

There was a patch related to this recently: https://reviews.llvm.org/D47610

...but in this case, the vector call is created outside of the vectorizers? The problem occurs for all potential SVML functions, not just 'sin'?