Performance of unaligned NEON loads via intrinsics is poor

johannkoenig commented 9 years ago


Bugzilla Link	24421
Resolution	FIXED
Resolved on	Dec 08, 2016 17:05
Version	3.6
OS	Linux
Attachments	Force unaligned access
CC	@asl,@efriedma-quic,@jmolloy

Extended Description

In libvpx we have disabled vp8_bilinear_predict4x4_neon (and the sixtap 4x4) because of issues with clang. At first it was only the Apple version but it is now in the versions being tested by Chromium.

When presented with src0 = vld1_lane_u32((const uint32_t *)src_ptr, src0, 0);

clang will automatically add a 32 bit alignment: vld1.32 {d18[0]}, [r0:32]

whether or not it is safe to do so. In this particular case it is much easier (and substantially faster) to cast and load 4 bytes instead of 4 individual 1 byte loads.

https://code.google.com/p/webm/issues/detail?id=817 https://chromium.googlesource.com/webm/libvpx/+/master/vp8/common/arm/neon/sixtappredict_neon.c

https://code.google.com/p/webm/issues/detail?id=892 https://chromium.googlesource.com/webm/libvpx/+/master/vp8/common/arm/neon/bilinearpredict_neon.c

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS8.3.sdk -arch armv7 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS8.3.sdk -DNDEBUG -O3 -fno-strict-aliasing -c -o bug.c.o bug.c

Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn) Target: x86_64-apple-darwin14.4.0 Thread model: posix

$ otool -tv bug.c.o
bug.c.o: (TEXT,text) section _main: 00000000 2000 movs r0, #0x0 00000002 4770 bx lr _bug: 00000004 f1000212 add.w r2, r0, #0x12 00000008 f9e0283f vld1.32 {d18[0]}, [r0:32] 0000000c f9e2083f vld1.32 {d16[0]}, [r2:32] 00000010 f100021b add.w r2, r0, #0x1b 00000014 3009 adds r0, #0x9 00000016 f9e208bf vld1.32 {d16[1]}, [r2:32] 0000001a f9e028bf vld1.32 {d18[1]}, [r0:32] 0000001e f1010009 add.w r0, r1, #0x9 00000022 ff4208a0 vsub.i8 d16, d18, d16 00000026 f9c1083f vst1.32 {d16[0]}, [r1:32] 0000002a f9c008bf vst1.32 {d16[1]}, [r0:32] 0000002e ef200110 vorr d0, d0, d0 00000032 4770 bx lr

efriedma-quic commented 7 years ago

Trunk clang generates the expected code given the "attribute((aligned(1)))" annotation. If you want to avoid gcc extensions, you can also memcpy to a local uint32_t... but that doesn't optimize well using gcc.

johannkoenig commented 9 years ago

If you fix it in trunk I can add a version check on that particular function. We build in too many places and have too many compiler versions to track down.

Having the ability to hint to the compiler that the data is unaligned is valuable though and I can see using this in the future when it's more prevalent. In that case, it would be nice to make sure it works with other data types as well.

jmolloy commented 9 years ago

Excellent. So the next problem is fixing the performance of the generated code.

If I fix it in trunk, is that good enough for you, or are you using a released version (and we need potentially a more horrific workaround)?

johannkoenig commented 9 years ago

Thanks for the information! Yes, using an unaligned type for casts that we know may be unaligned would be acceptable. I realize I'm forcing something that is unexpected so being able to annotate that would be ideal.

jmolloy commented 9 years ago

Hi Johann,

Unfortunately you've broken the alignment rules of C behind the compiler's back, which is why it produces incorrect code. uint32_t must be 4-byte aligned, and it knows that, so it provides the hint.

After chatting a bit with our NEON intrinsics architect (who also speaks for GCC) unfortunately the only correct way to do this is rather nastily to create an unaligned uint32_t type using attribute((aligned)):

typedef uint32_t attribute((aligned(1))) uint32_unaligned_t;

uint32x2_t f(uint32x2_t src0, uint8_t src_ptr) { // I will break. return vld1_lane_u32((const uint32_t )src_ptr, src0, 0); }

uint32x2_t g(uint32x2_t src0, uint8_t src_ptr) { // I won't produce broken code. return vld1_lane_u32((const uint32_unaligned_t )src_ptr, src0, 0); }

Unfortunately, our codegen for the latter snippet is very bad (we bail out to scalar registers). This should be fairly easily fixable however.

Is the above edit one that you can reasonably make to your source code?

Cheers,

James

johannkoenig commented 9 years ago

assigned to @jmolloy

llvm / llvm-project

Performance of unaligned NEON loads via intrinsics is poor #24795

Extended Description