matklad / once_cell

Rust library for single assignment cells and lazy statics without macros
Apache License 2.0
1.87k stars 109 forks source link

Factor out thread parking from initialization #69

Closed pitdicker closed 5 years ago

pitdicker commented 5 years ago

I made an effort to convince myself the std implementation of initialize was correct. Mixing thread parking with initialization made it tricky, so I tried to factor out the parking.

The existing logic was very careful en seems quite optimal. Al the parts should still be there, but I imagine that may be hard to review. I did change the atomic orderings, as they did not some all that hard to get right here, and make it a bit easier to see what's going on.

You may not want this PR, as being similar to std::sync::Once is an advantage. But I think the code is now much more readable, and personally think it is worth it.

matklad commented 5 years ago

Hm, are the new relaxed orderings correct though? I think we can get a lost wakeup with the current impl.

Let's say we have Thread 1 in the wait loop, and Thread 2 finishing initialization unsuccesfully.

Thread 2 sets state_and_queue to EMPTY using Relaxed, and then signals Thread 1 using Relaxed.

Thread 1 sees this relaxed signal, wakes up, and loads state_and_queue using Acquire. I think Thread 1 is now allowed to see an old value of state_and_queue, RUNNING. That means that Thread 1 might go back into the waiting state, and get stuck.


Moreover, I feel like there's even a more serious problem here. I think with relaxed store here and relaxed load in the wait loop, we don't have a guarantee that we load the thread while it is alive! In other words, compiler might as well swap the two lines, so that first we do (*queue).signaled.store(true, Ordering::Relaxed);, and then let thread = (*queue).thread.take().unwrap();. And, if the waiting thread wakes up in-between and exits the wait loop, that thread load would be a use after free

pitdicker commented 5 years ago

Let's say we have Thread 1 in the wait loop, and Thread 2 finishing initialization unsuccesfully.

Thread 2 sets state_and_queue to empty using Relaxed, and then signals Thread 1 using Relaxed.

Thread 1 sees this relaxed signal, wakes up, and loads state_and_queue using Acquire. I think Thread 1 is now allowed to see an old value of state_and_queue, RUNNING. That means that Thread 1 might go back into the waiting state, and get stuck.

No matter what ordering is used, other threads will always see the most recent value of an atomic. Just maybe not the surrounding data (as I understand atomics anyway).

Moreover, I feel like there's even a more serious problem here. I think with relaxed store here and relaxed load in the wait loop, we don't have a guarantee that we load the thread while it is alive! In other words, compiler might as well swap the two lines, so that first we do (*queue).signaled.store(true, Ordering::Relaxed);, and then let thread = (*queue).thread.take().unwrap();. And, if the waiting thread wakes up in-between and exits the wait loop, that thread load would be a use after free

You are completely right. I know (but forgot for a moment) that while dropping the Waiters the function is reaching into the memory of another thread and changing it (there goes my confidence :smile: )!

matklad commented 5 years ago

No matter what ordering is used, other threads will always see the most recent value of an atomic.

Hm, I am pretty sure that's not the case. What Relaxed and stronger orderings guarantee is that all threads will observe the same modification order for each location. That is, if one threads does

and the other thread does

it is allowed to see 111, 333, 123, 133, but not 132. There's no guarantee that the the loader thread sees the last load, unless there's a happens-before edge between load and store.

More over, I am not even sure that "the most recent value" is defined for orderings weaker than SeqCst.

Consider this example (from the recent thread on internals.rust-lang.org):

// Thread1
x.store(0, Release);
x.store(1, Release);
y.load(Acquire); // may return 0

// Thread2
y.store(0, Release);
y.store(1, Release);
x.load(Acquire); // may return 0 as well!

From Thread1 pov, the most recent value of x is 1, and the most recent value of y is 0. From Thread2 pov, the most recent value of x is 0, and the most recent value of y is 1.

pitdicker commented 5 years ago

Yes, my wordings where imprecise. I was thinking about one atomic, and what happens in time.

Let's take this Relaxed ordering. There may be multiple threads that trying to insert their node at the head of the linked list. When one wins it updates that atomic, and all other threads that try to read/set it later in time will see the update, even with just the Relaxed ordering. That is the fundamental feature an atomic provides.

When we talk about reordering instructions around atomics, making changes to non-atomic memory available to other threads, or reordering when multiple atomics are involved, we have to watch out and carefully choose the right ordering. But when we care about just one atomic, Relaxed should be enough.

matklad commented 5 years ago

Hm, I still think that this reasoning is flawed: to say that one thread does write and the other thread observes this write, we need some kind of global time, and this brings us into multiple locations territory. Let me try to describe the first problem in terms of compiler reorderings.

Essentially, what we have is

static state: AtomicUsize = EMPTY;
static signaled: AtomicBool = false;

// Thread1
loop {
    let state = state.load(Acquire);
    if state == RUNNING {
        while !signaled.load(Relaxed) {}
    } else { /* do something useful */ break; }
}

// Thread2

state.store(EMPTY, Relaxed);
signaled.store(true, Relaxed);

The compiler can reorder the two stores in Thread2, because Relaxed does not forbid that

pitdicker commented 5 years ago

Thank you for being patient with me. Now I see the tricky area you mean. When waking up a thread there are two atomics in play, with no clear relation between them.

I will think about it before for a while longer replying (intuitively a thread getting parked feels like some barrier, but the compiler may think otherwise).

But even if these orderings don't work out, that is something that is good to know and document. Then SeqCst would not only be used to 'minimize the chances of something going wrong`, but out of necessity.

pitdicker commented 5 years ago

We have two places with the trouble of two atomics, in wait when it gets unparked, and in the wakeup routine of WaiterQueue.

In WaiterQueue there is no problem. It first does a swap on state_and_queue to get the pointer to the first node, and then does a store on signaled from that node. There is a clear ordering there.

I think it is also possible to establish a clear order in wait, by making the load on state_and_queue dependent on signaled.

// We have enqueued ourselves, now lets wait.
// Guard against spurious wakeups by reparking ourselves until we are signaled.
loop {
    if !node.signaled.load(Ordering::Acquire) {
        thread::park();
    } else {
        // We are dealing with two atomics here, `node.signaled` and `state_and_queue`.
        // A load on `state_and_queue` can not simply be done after this loop: that would cause
        // there to be no explicit relationship between the two atomics, which gives the
        // compiler freedom to reorder the load to before the loop (and before parking).
        //
        // Instead we load `state_and_queue` in this else block after checking `node.signaled`,
        // establishing an order.
        return state_and_queue.load(Ordering::Acquire);
    }
}

wait now has to give state_and_queue as return value, but that does not seem all that strange.

What do you think? I believe with this part we have tackled these concerns

matklad commented 5 years ago

Yeah, I think making operations on signaled release/acquire should work.

However, I think I would prefer to stick with the implementation which is as close to std as possible. I can find some problems in concurrent code, but I am not expert enough to guarantee the absence of thread-safety bugs with high probability :)

I also agree that using SeqCst is actually harder to reason about then Release/Acquire, but sticking with code copy-pasted from std seems more reliable.

As for splitting this into function, I think I actually prefer to review atomics-based code with fewer functions: usually, you need to think about interractions of reads and writes from different parts of code, and it's easy to do if all these reads/writes are on the single screen with as little abstractions as possible.

So, let's close this for now. However, if a similar cleanup is applied to std::sync::Once, we should definitelly port improvements here as well