llvmbot commented 3 years ago


Bugzilla Link	51263
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@efriedma-quic,@Kojoley

Extended Description

The following code (https://godbolt.org/z/fcqexoYMe):

include

struct normal_storage { uint8_t a[8]; };

struct atomic_storage { std::atomic a[8]; };

void f(normal_storage ns, atomic_storage as) { for (size_t i = 0; i < 8; ++i) { as->a[i].store(ns->a[i], std::memory_order_relaxed); } }

generates 8 single-byte loads and stores when compiled for x86_64 with -O3 with clang trunk.

f(normal_storage, atomic_storage): # @f(normal_storage, atomic_storage) mov al, byte ptr [rdi] mov byte ptr [rsi], al mov al, byte ptr [rdi + 1] mov byte ptr [rsi + 1], al mov al, byte ptr [rdi + 2] mov byte ptr [rsi + 2], al mov al, byte ptr [rdi + 3] mov byte ptr [rsi + 3], al mov al, byte ptr [rdi + 4] mov byte ptr [rsi + 4], al mov al, byte ptr [rdi + 5] mov byte ptr [rsi + 5], al mov al, byte ptr [rdi + 6] mov byte ptr [rsi + 6], al mov al, byte ptr [rdi + 7] mov byte ptr [rsi + 7], al ret

Given that relaxed atomics provide no guarantees about ordering, it seems to me like it may be legal to coalesce this into a single 8-byte store instead.

Is this a missed optimization? Or is there some specific language in the standard that makes such an optimization illegal?

For what it's worth, from what I can tell it looks like none of GCC, ICC, nor MSVC make this optimization.

Clearly the above code could be changed to use a single 8-byte atomic instead, but that's somewhat besides the point. I can provide more context on why this is useful if it helps.

efriedma-quic commented 3 years ago

For example, the ARMv8 architecture manual says "All accesses to any byte are single-copy atomic", which would allow combining the testcase in comment 0 to a single store instruction. But if the testcase were changed to use uint16_t, it isn't atomic. Unless we use an appropriate NEON store instruction, in which case special rules for NEON stores kick in.

efriedma-quic commented 3 years ago

Does the standard allow us to turn multiple atomic operations into a single operation? Yes; there isn't any way for another thread to observe that.

Will the result actually be atomic if we "mov qword ptr [rsi], rax" here? Not sure. The architecture guarantees are weaker, especially if the pointer isn't aligned.

llvm / llvm-project

Possible missed optimization: relaxed stores to narrow atomic types are not coalesced #50607

Extended Description

include

include