NEON intrinsics prevent removing redundant store and load on armv7

llvmbot commented 9 years ago


Bugzilla Link	21778
Resolution	FIXED
Resolved on	Feb 29, 2016 10:29
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@ahmedbougacha,@hfinkel,@rengolin

Extended Description

I noticed that LLVM was unable to optimize redundant stores and loads when I added some NEON intrinsics into my 4x4 matrix multiply function.

I've created a smaller example that shows the same problem. This is using "Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)" (from the latest XCode bundle with iOS SDK), targeting armv7, using -O3.

Here is the test code without intrinsics:

struct vec4 { float data[4]; };

vec4 operator (const vec4& a, const vec4& b) { vec4 result; for(int i = 0; i < 4; ++i) result.data[i] = a.data[i] b.data[i];

return result; }

void TestVec4Multiply(vec4& a, vec4& b, vec4& result) { result = a * b; }

void TestVec4Multiply3(vec4& a, vec4& b, vec4& c, vec4& result) { result = a b c; }

In this case, the vectorizer actually generates the optimal code:

__Z16TestVec4MultiplyR4vec4S0S0: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r2] bx lr

__Z17TestVec4Multiply3R4vec4S0_S0S0: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vld1.32 {d18, d19}, [r2] vmul.f32 q8, q8, q9 vst1.32 {d16, d17}, [r3] bx lr

With my actual matrix multiply code the vectorizer is not as successful, hence wanting to help out the compiler with some intrinsics. Here's a replacement of the operator* with an implementation using NEON intrinsics:

vec4 operator* (const vec4& a, const vec4& b) { vec4 result;

float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data)); vst1q_f32(result.data, result_data);

return result; }

Unfortunately the generated code now has some redundant stores and loads:

__Z16TestVec4MultiplyR4vec4S0S0: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32 {d16, d17}, [r2] add sp, #16 bx lr

__Z17TestVec4Multiply3R4vec4S0_S0S0: @ BB#0: sub sp, #32 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r2] vld1.32 {d18, d19}, [r0] add r0, sp, #16 vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r0] vst1.32 {d16, d17}, [r3] add sp, #32 bx lr

These seem to be especially bad news on many ARM cores. See here: http://lists.freedesktop.org/archives/pixman/2011-August/001398.html

In my testing of 4x4 matrix multiply, the version with the temporaries ends up about 3x slower than code that has them eliminated.

rengolin commented 8 years ago

I'm closing this bug, since it was related to the (already fixed) problem with intrinsics. The remaining problem with the SLP vectoriser should be covered by bug #26773.

rengolin commented 8 years ago

Revisiting this bug, I realised that Clang now can compile both intrinsic and pointer variants identically and optimally:

ZmlRK4vec4S1: @ @ZmlRK4vec4S1 .fnstart @ BB#0: @ %entry vld1.32 {d16, d17}, [r0] vld1.32 {d18, d19}, [r1] vmul.f32 q0, q9, q8 bx lr

The loop version:

vec4 operator (const vec4& a, const vec4& b) { vec4 result; for(int i = 0; i < 4; ++i) result.data[i] = a.data[i] b.data[i];

return result; }

Still produces loads into S registers from GPRs because the arguments are not vectors. The loop vectorizer doesn't pass on that loop because it's scalarized before it sees it (iteration space too small). The SLP vectorizer doesn't work because there are no stores in that loop.

Forcing the loop vectorizer to pass earlier would break a lot of other assumptions about the state of the IR.

Changing the SLP vectorizer to treat returns as stores may get us what we want and in turn, add a whole new class of optimisations. Hal should know more about it.

llvmbot commented 9 years ago

However, that issue now has lower priority, since the work around is sane and correct code.

Thanks for confirming that - I was intending to follow up here rather than on the mailing list.

This helps on GCC too, which also suffers from the redundant instructions using the standard intrinsics (with both 4.8 and 4.9 from the android NDK toolchain).

rengolin commented 9 years ago

Simon has found a workaround that is actually better than NEON intrinsics, by using pointers:

typedef float32x4_t __attribute((aligned(4))) f32x4_align4_t;

vec4 operator* (const vec4& a, const vec4& b) { vec4 result;

float32x4_t a_data = ((f32x4_align4_t)a.data); float32x4_t b_data = ((f32x4_align4_t)b.data);

float32x4_t result_data = vmulq_f32(a_data, b_data);

((f32x4_align4_t)result.data) = result_data;

return result; }

This has the exact same result as auto-vectorised code without vmulq_f32(). While this is a good thing (the load/store intrinsics are unnecessary anyway), there still exists the issue where the load/store intrinsics are not correctly mapped.

However, that issue now has lower priority, since the work around is sane and correct code.

rengolin commented 9 years ago

However, replacing the specialization with a manually unrolled C one: tmp[0] = l[0]r[0] + l[4]r[1] + l[8]r[2] + l[12]r[3]; tmp[1] = l[1]r[0] + l[5]r[1] + l[9]r[2] + l[13]r[3]; etc...

That one does vectorize, and seems to have reasonable redundant temporary elimination for chained multiplies, but is still not perfect (it's around 2x slower than the manual vectorization).

So, this opens yet another problem: Clang is not generating IR from templates that makes it easier to vectorize. :)

The vectorizer itself seems to be doing a pretty good job, reaching 7.5x performance improvements.

That's why I was focussing on the redundant temporary issue rather than the non-perfect vectorization.

I'd say the most important thing here is to get it right for user code. Template programming is really important for scientific computing, and in that domain, it creates a lot of very redundant and very obvious vectorization candidates. Making that code vectorize, either by changing it or Clang, is the most important thing we can do. However, that's better done on the list or even over IRC, than it is on a bug, I think.

The other two alternatives, to either hand-craft special cases in C++, or hand-craft even more special cases in NEON intrinsics, would be workarounds that I'd rather not have to do.

Regardless of that, we still have the temporaries under NEON intrinsics, that is a bug and needs fixing, and that's the reason for this bug.

llvmbot commented 9 years ago

Simon,

Your simplified scalar version does vectorize, so I'd guess your best bet is to show us the code that doesn't vectorize, so we can find why not. Since you're not doing anything magical, it should be just a case of finding the reason and fixing it.

It's a heavily templated linear algebra library, TooN: https://github.com/edrosten/TooN

I've stuck in the intrinsics in a specialization of the matrix mutliply operator:

inline Matrix<4, 4, float, RowMajor> operator*(const Matrix<4, 4, float, RowMajor>& m1, const Matrix<4, 4, float, RowMajor>& m2) { return intrinsics_mat4_multiply(m1,m2); }

The standard TooN implementation doesn't appear to vectorize at all (or at least not well - it's 15x slower than the hand-vectorized one).

I agree it's interesting why the standard TooN multiplies don't vectorize well; for small fixed-sized matrices the data is allocated on the stack and the number of rows and columns should all be compile-time constants. I'll definitely provide more and better examples of that code if you'd like.

However, replacing the specialization with a manually unrolled C one: tmp[0] = l[0]r[0] + l[4]r[1] + l[8]r[2] + l[12]r[3]; tmp[1] = l[1]r[0] + l[5]r[1] + l[9]r[2] + l[13]r[3]; etc...

That one does vectorize, and seems to have reasonable redundant temporary elimination for chained multiplies, but is still not perfect (it's around 2x slower than the manual vectorization). That's why I was focussing on the redundant temporary issue rather than the non-perfect vectorization.

rengolin commented 9 years ago

Simon,

Your simplified scalar version does vectorize, so I'd guess your best bet is to show us the code that doesn't vectorize, so we can find why not. Since you're not doing anything magical, it should be just a case of finding the reason and fixing it.

that leaves us with two bugs, the unvectorized code (this could continue on the list), and the temporaries being left over, which should be followed up here.

cheers, --renato

llvmbot commented 9 years ago

I posted this on the mailing list. renato.golin@linaro.org responded with this:

If I had to guess, I'd say the intrinsic got in the way of recognising the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but vld1q_f32 is still kept as an intrinsic, so register allocators and schedulers get confused and, when lowering to assembly, you're left with garbage around it.

I've tried expanding out the vst1q_f32 to scalar operations, which has helped a bit:

vec4 operator* (const vec4& a, const vec4& b) { vec4 result;

float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data));
result.data[0] = vgetq_lane_f32(result_data, 0);
result.data[1] = vgetq_lane_f32(result_data, 1);
result.data[2] = vgetq_lane_f32(result_data, 2);
result.data[3] = vgetq_lane_f32(result_data, 3);

return result;

}

__Z16TestVec4MultiplyR4vec4S0S0: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r2] bx lr

__Z17TestVec4Multiply3R4vec4S0_S0S0: @ BB#0: sub sp, #16 vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] mov r0, sp vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r0] vld1.32 {d16, d17}, [r2] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r3] add sp, #16 bx lr

That's got rid of one of the temporaries, but I imagine the load is still not properly visible to the right bit of the optimizer. I attempted to replace that with some vsetq_lane intrinsics but they didn't generate valid code. I might be using them wrong, or that may be another bug...

If anyone can come up with a scalar incantation that will be auto-vectorized to a vld1.32 then that might provide a workaround for now.

llvmbot commented 9 years ago

assigned to @rengolin

llvm / llvm-project

NEON intrinsics prevent removing redundant store and load on armv7 #22152

Extended Description

If I had to guess, I'd say the intrinsic got in the way of recognising the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but vld1q_f32 is still kept as an intrinsic, so register allocators and schedulers get confused and, when lowering to assembly, you're left with garbage around it.