llvmbot commented 3 years ago


Bugzilla Link	51621
Version	trunk
OS	Linux
Reporter	LLVM Bugzilla Contributor
CC	@DMG862,@smithp35

Extended Description

Also applies to: armv7-a clang 11.0.1

Godbolt: https://godbolt.org/z/9f1GKf5qe

Consider this code:

include

uint32_t read_unaligned_memcpy_bswap_32(const uint8_t *buf, int offset) { uint32_t val; __builtin_memcpy(&val, buf+offset, 4); return __builtin_bswap32(val); }

uint32_t read_unaligned_shift_add_32(const uint8_t *buf, int offset) { return (((uint32_t)buf[offset]) << 24) + (((uint32_t)buf[offset+1]) << 16) + (((uint32_t)buf[offset+2]) << 8) + (((uint32_t)buf[offset+3]) << 0); }

On many architectures, eg. ARMv8, these produce identical and efficient code. On ARMv7a, __builtin_bswap32 version produces what looks like worse code compared to the shift+add version (although I admit I don't know the architecture well enough to be sure, but at least the result has 14 instructions as opposed to 8):

read_unaligned_memcpy_bswap_32(unsigned char const*, int): ldrb r1, [r0, r1]! ldrb r2, [r0, #1] ldrb r3, [r0, #2] ldrb r0, [r0, #3] orr r1, r1, r2, lsl #8 orr r0, r3, r0, lsl #8 mov r2, #16711680 orr r0, r1, r0, lsl #16 mov r1, #65280 and r1, r1, r0, lsr #8 and r2, r2, r0, lsl #8 orr r1, r1, r0, lsr #24 orr r0, r2, r0, lsl #24 orr r0, r0, r1 bx lr

read_unaligned_shift_add_32(unsigned char const*, int): ldrb r1, [r0, r1]! ldrb r2, [r0, #1] ldrb r3, [r0, #2] ldrb r0, [r0, #3] lsl r2, r2, #16 orr r1, r2, r1, lsl #24 orr r1, r1, r3, lsl #8 orr r0, r1, r0 bx lr

The same applies to the 16-bit version (see the Godbolt link for code), but the difference is much less dramatic (also there trunk generates one instruction more compared to 11.0.1 for the 16-bit bswap version; I don't know how significant that is).

llvmbot commented 3 years ago

Ah, I see. That makes sense. Thanks!

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 3 years ago

The default architecture for Arm is very old now, and you really have to specify something newer. This is for -march=armv7-a, which is inline with what I would expect: https://godbolt.org/z/qnWrzGKMs

read_unaligned_memcpy_bswap_32(unsigned char const, int): ldr r0, [r0, r1] rev r0, r0 bx lr read_unaligned_shift_add_32(unsigned char const, int): ldr r0, [r0, r1] rev r0, r0 bx lr

llvm / llvm-project

ARMv7a: inefficient code generated from memcpy + bswap builtin #50963

Extended Description

include

uint32_t read_unaligned_shift_add_32(const uint8_t *buf, int offset) { return (((uint32_t)buf[offset]) << 24) + (((uint32_t)buf[offset+1]) << 16) + (((uint32_t)buf[offset+2]) << 8) + (((uint32_t)buf[offset+3]) << 0); }