llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.87k stars 11.92k forks source link

ARMv7a: inefficient code generated from memcpy + bswap builtin #50963

Open llvmbot opened 3 years ago

llvmbot commented 3 years ago
Bugzilla Link 51621
Version trunk
OS Linux
Reporter LLVM Bugzilla Contributor
CC @DMG862,@smithp35

Extended Description

Also applies to: armv7-a clang 11.0.1

Godbolt: https://godbolt.org/z/9f1GKf5qe

Consider this code:


include

uint32_t read_unaligned_memcpy_bswap_32(const uint8_t *buf, int offset) { uint32_t val; __builtin_memcpy(&val, buf+offset, 4); return __builtin_bswap32(val); }

uint32_t read_unaligned_shift_add_32(const uint8_t *buf, int offset) { return (((uint32_t)buf[offset]) << 24) + (((uint32_t)buf[offset+1]) << 16) + (((uint32_t)buf[offset+2]) << 8) + (((uint32_t)buf[offset+3]) << 0); }

On many architectures, eg. ARMv8, these produce identical and efficient code. On ARMv7a, __builtin_bswap32 version produces what looks like worse code compared to the shift+add version (although I admit I don't know the architecture well enough to be sure, but at least the result has 14 instructions as opposed to 8):

read_unaligned_memcpy_bswap_32(unsigned char const*, int): ldrb r1, [r0, r1]! ldrb r2, [r0, #​1] ldrb r3, [r0, #​2] ldrb r0, [r0, #​3] orr r1, r1, r2, lsl #​8 orr r0, r3, r0, lsl #​8 mov r2, #​16711680 orr r0, r1, r0, lsl #​16 mov r1, #​65280 and r1, r1, r0, lsr #​8 and r2, r2, r0, lsl #​8 orr r1, r1, r0, lsr #​24 orr r0, r2, r0, lsl #​24 orr r0, r0, r1 bx lr

read_unaligned_shift_add_32(unsigned char const*, int): ldrb r1, [r0, r1]! ldrb r2, [r0, #​1] ldrb r3, [r0, #​2] ldrb r0, [r0, #​3] lsl r2, r2, #​16 orr r1, r2, r1, lsl #​24 orr r1, r1, r3, lsl #​8 orr r0, r1, r0 bx lr

The same applies to the 16-bit version (see the Godbolt link for code), but the difference is much less dramatic (also there trunk generates one instruction more compared to 11.0.1 for the 16-bit bswap version; I don't know how significant that is).

llvmbot commented 3 years ago

Ah, I see. That makes sense. Thanks!

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 3 years ago

The default architecture for Arm is very old now, and you really have to specify something newer. This is for -march=armv7-a, which is inline with what I would expect: https://godbolt.org/z/qnWrzGKMs

read_unaligned_memcpy_bswap_32(unsigned char const, int): ldr r0, [r0, r1] rev r0, r0 bx lr read_unaligned_shift_add_32(unsigned char const, int): ldr r0, [r0, r1] rev r0, r0 bx lr