[libc++] Incorrect memory order in atomic wait

I think I've found an issue in __libcpp_atomic_wait, where an atomic operation with an overly-relaxed memory order argument could theoretically result in a lost wakeup. For what it's worth, the equivalent part of libstdc++'s atomic wait implementation uses stronger memory ordering, which I believe to be correct.

To be clear, I don't think this is a problem in practice on x86 or ARM, but according to the C++ standard it looks like a bug.

Background

The atomic wait implementation in libc++ uses platform wait primitives such as futexes (linux) and __ulock_wait/__ulock_wait (macOS) to perform wait and notify operations. Because the platform wait primitives usually only work for values of a particular width (32-bit on Linux), you can't always wait on the user's value directly. The library instead does a laundering trick where internally it waits on a proxy value of the correct size. To do this, it uses the address of the user's atomic value to look up an entry in a 'contention table'. Each entry in the table contains two atomic values of an appropriate size for the platform; one representing a monotonically increasing counter which is used as the proxy for the user's value, and one holding a count of the number of threads currently waiting on the variable. The latter is used in a nifty optimisation which allows the platform notify call to be elided when there are no threads waiting on the value.

The key parts of the algorithm

The algorithm that does the contention tracking is split across a few functions: __libcpp_contention_wait, __libcpp_atomic_notify, __libcpp_contention_notify

But after inlining and removal of non-relevant parts, it essentially boils down to this pseudocode:

std::atomic<uint32_t> platform_state;   // Monotonically increasing count
std::atomic<uint32_t> contention_state; // Waiter count

-------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_release);      // (1) <---- Here is the problem
if (0 != contention_state.load(mo_seq_cst)) { // (2) 
  platform_notify_all(platform_state);
}

-------------------------------
// Waiting thread
contention_state.fetch_add(1, mo_seq_cst);    // (3)
platform_wait(platform_state, old_value);     // (4)
contention_state.fetch_sub(1, mo_release);    // (5)

The line marked "Here is the problem" corresponds to this line in atomic.cpp.

The problem

According to the C++ memory model, we should reason about code involving atomics in terms of "synchronizes-with" and "happens-before" relations, rather than potential reorderings. The contention tracking algorithm cannot be reasoned about solely in terms of synchronization between release-acquire pairs. It also requires the waiting thread and the notifying thread to agree on a single total order for operations (1), (2), (3) and (4). The way to achieve that (according to the standard) is to make all four operations seq-cst.

Based on that, the mo_release on (1) looks insufficient. Informally, under the C++ memory model, I do not believe there is anything stopping (1) from 'reordering' with (2). We need some sort of StoreLoad barrier to get the single total order property.

A Herd7 simulation shows that a lost wakeup could occur under the C++ memory model

Inspired by @jiixyj's Herd7 simulation of a similar contention tracking mechanism in the semaphore implementation, I thought I'd have a go here too. I'm certainly no expert with Herd, so treat this with a pinch of salt.

I've modelled the existing algorithm as:

`prog.litmus`

C prog

{
}

P0 (atomic_int* plat, atomic_int* waiters) {
  atomic_fetch_add_explicit(plat, 1, memory_order_release);
  int num_waiters = atomic_load_explicit(waiters, memory_order_seq_cst);
  int do_notify = (num_waiters != 0);
}

P1 (atomic_int* plat, atomic_int* waiters) {
  atomic_fetch_add_explicit(waiters, 1, memory_order_seq_cst);
  int do_wait = (atomic_load_explicit(plat, memory_order_seq_cst) == 0);

  // Deliberately left out - it's not relevant to the analysis
  //atomic_fetch_sub_explicit(waiters, 1, memory_order_release);
}

exists
(0:do_notify=0 /\ 1:do_wait=1)

It's not possible to model read-compare-wait operations in Herd. Instead I've modelled the platform wait as a sequentially consistent load; I'm genuinely unsure if that's justified. I've found some discussion on stack overflow but I don't find it fully convincing. In D114119 it's modelled as a relaxed compare-exchange-strong. That looks like it might be overkill for this situation, but I can't say for sure. Guidance on this would be welcome.

The proposition at the bottom looks for the case where thread P1 enters a waiting state, but thread P0 decides not to call notify, which could result in deadlock. The /\ token represents logical conjunction.

Executing this in Herd with:

mkdir -p output
rm -f output/prog.dot
herd7 -c11 -show prop -graph columns -o output -showinitwrites false -oneinit false prog.litmus
dot -Tpdf output/prog.dot > prog.pdf

shows that there is one possible execution exhibiting the lost wake-up behaviour. It corresponds to thread P0 running to the end without calling notify, while P1 decides that a 'wait' is necessary. Crucially, even though P0 has incremented plat 'before' checking waiters, P1 observes an older value in the modification order of plat and therefore decides to enter a waiting state.

In the Herd simulation, the issue can be solved by enforcing a total order on the operations involving plat. Upgrading the relaxed increment of plat in thread P0 to have seq_cst memory order results in Herd reporting no remaining executions satisfying the proposition.

Would a compiler ever make such reordering?

I've not found good sources on this. My suspicion is that it's unlikely to happen in practice. But I believe the standard allows it.

Could it happen on x86?

On x86, even a fully relaxed RMW operation has sequentially consistent behaviour. So no, it can't happen on x86.

Could it happen on ARM architectures?

On AArch64, the 'notify' side of the algorithm compiles down to a Load-linked/Store-conditional (LL/SC) loop for incrementing the platform counter, followed by a load-acquire (LDAR) for reading the number of waiters. The STLXR in the LL/SC loop, and the following LDAR instruction both have acquire-release semantics, so they will not reorder.

For what it's worth, if the LDAR is relaxed to a plain LDR (i.e. std::memory_order_relaxed) then Herd shows that it can reorder with the STLXR and result in a lost wakeup. The relevant parts of the algorithm are modelled here:

AArch64 Contention_Tracking
{
0:X0=plat; 0:X1=count;
1:X0=plat; 1:X1=count;
}
 P0                 | P1;
 label1:            | label2:;
 LDXR W8, [X0]      | LDAXR W8, [x1];
 ADD W8, W8, #1     | ADD W8, W8, #1;
 STLXR W9, W8, [X0] | STLXR W9, W8, [x1];
 CBNZ W9, label1    | CBNZ W9, label2;
 LDR W2, [X1]       | LDAR W3, [X0];
exists
(0:X2=0 /\ 1:X3=0)

You can try this in the online Herd sim. The proposition at the bottom corresponds to the notify side seeing no waiters (and therefore not calling the platform notify operation), and the waiter side not seeing the increment to the platform variable.

Could it happen on more relaxed architectures?

It looks like it is possible on POWER, but I haven't done any further investigation.

Potential solutions

I think the best solution is probably to upgrade (1) to mo_seq_cst. This is what libstdc++ does. On x86, even a fully relaxed RMW operation has sequentially consistent behaviour, so the change will only affect potential compiler reorderings. On ARM, upgrading a RMW op from release to seq_cst turns an ldxr into a ldaxr in the LL/SC loop. This ought to be far cheaper than any sort of fence.

Other options would be to turn (2) into a read-don't-modify-write operation (fetch_add(0, mo_acq_rel)) or to insert seq_cst fences between (1) & (2) and (3) and (4) (along with fully relaxing the other operations). Both of which look more expensive to me.

Concrete proposal

Update this line to be memory_order_seq_cst: https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L166

Fantastic analysis! I've only skimmed over it, but I believe you are correct.

Here are some additional thoughts:

I don't think that modelling futex_wait with anything involving seq_cst is 100% correct. The linked StackOverflow post models it with a.compare_exchange_strong(..., std::memory_order_seq_cst, std::memory_order_seq_cst);. But I believe this is a wrong reading of the man page. Involving seq_cst would mean that futex_wait would join the single total modification order of all atomic operations that are tagged with seq_cst (including seq_cst operations on other atomic variables!). However, the futex man page states (emphasis mine):

The loading of the futex word's value, the comparison of that value with the expected value, and the actual blocking will happen atomically and will be totally ordered with respect to concurrent operations performed by other threads on the same futex word.

Together with the following quote:

Analogously to an atomic compare-and-exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare-and-block operation.

...I think it is most appropriate to model futex_wait as a.compare_exchange_strong(..., std::memory_order_relaxed, std::memory_order_relaxed);. My mental model of this is that it's analogous to condition variables -- cv_wait/cv_notify don't provide any memory ordering on their own, but their associated mutex does! Similarly, futex_wait/futex_notify don't provide any memory ordering on their own -- it has to be ensured by the user in some other way.

That being said, Hyrum's Law is still in effect. Internally, one of the first things futex_wait does is call smb_mb() which I believe is a full memory barrier and more or less equivalent to a std::atomic_thread_fence(std::memory_order_seq_cst) fence. So a more "real life" model of futex_wait that depends on implementation details of the Linux kernel might be std::atomic_thread_fence(std::memory_order_seq_cst); a.compare_exchange_strong(..., std::memory_order_relaxed, std::memory_order_relaxed);.

The proposed fix (upgrading (1) to mo_seq_cst) relies on futex_wait providing a std::atomic_thread_fence(std::memory_order_seq_cst). This is OK if we assume that futex_wait provides this. If we don't, then we have to write that fence ourselves before calling into the platform wait function. I think that (3) being seq_cst is not enough. The seq_cst fence has to be there, after (3).

As noted by cbloom in this blog post, what we actually need here is not seq_cst, but a store/load barrier between (1) and (2). One indication that seq_cst is not needed is that there are only two threads in our example. However, seq_cst should only be truly needed when there are three or more threads.

So all the seq_cst operations are only there to work around C++'s missing store/load barrier. Because of this, I prefer the other solution of turning (2) into fetch_add(0, mo_rel) (I believe rel is enough here), and weakening (3) to acq. With this we worked around C++'s lack of a store/load barrier by turning the load into a store. On x86_64 this should optimize nicely with Clang (it will emit a memory fence followed by a load instead of a RMW operation, IIRC). On other platforms/compilers this might be not so optimized, though, which might make that solution a non-starter in practice... One additional upside though is that we can drop the assumption that futex_wait provides a std::atomic_thread_fence(std::memory_order_seq_cst) fence. It all works out, even if we assume futex_wait is purely a relaxed compare-and-block operation.

If we are going with the seq_cst solution, I think it would be a good idea to put a std::atomic_thread_fence(std::memory_order_seq_cst) fence before the platform wait function to avoid depending on implementation details of futex_wait. We are entering the kernel anyway, and this is on the "slow" path, so the overhead of this in practice should be minimal.

llvm / llvm-project