Race Condition which can lead to Undefined Behavior

hildebrandmw commented 1 month ago

This is in reference this: https://github.com/threadexio/rcurs/blob/c4728e8e7b0d812bc62ef808066ae27ae48e2332/src/rcu.rs#L58-L62

pub fn get(&self) -> Guard<'_, T> {                            // 1
    let inner = self.ptr.load(Ordering::Relaxed).cast_const(); // 2
    unsafe { (*inner).refs.take_ref() };                       // 3
    Guard { _marker: PhantomData, inner }                      // 4
}

The race condition exists between lines 2 and 3. Suppose thread A finishes line 2, acquiring a pointer to inner and is then pre-empted by the OS before it can increment the ref-count. While thread A is asleep, thread B invokes update, swaps the pointer, decrements the old ref-count, and deletes the old object.

When thread A resumes, inner is now dangling and line 3 triggers a segfault (if you are lucky) or writes randomly to memory (if you are not).

threadexio commented 1 month ago

So I have managed to replicate the race condition by inserting a hacky sleep inside the get method between lines 2 and 3 and then update the RCU from another thread.

I have examined the race condition a bit. The easiest fix for this would probably be wrapping the problematic lines inside a critical-section to prevent preemption. But I think this fix would not be testable with my current sleep hack. Maybe something can be done with clever usage of atomics. More investigation is needed.

I will examine this further tomorrow as it has been a long day.

threadexio commented 1 month ago

I think the race can happen even if no preemption takes place.

Suppose thread A is executing the get:

https://github.com/threadexio/rcurs/blob/c4728e8e7b0d812bc62ef808066ae27ae48e2332/src/rcu.rs#L58-L62

Now suppose thread B is executing update:

https://github.com/threadexio/rcurs/blob/c4728e8e7b0d812bc62ef808066ae27ae48e2332/src/rcu.rs#L38-L42

If thread A loads the pointer to inner and then thread B swaps it for the new one and calls drop_inner before thread A has time to call take_ref, then the inner structure is dropped by thread B but A still has a pointer to it. And then A tries to increment the ref count and the bug happens.

Therefore, wrapping get in a critical section does not solve the issue, just makes it way less likely to occur.

hildebrandmw commented 1 month ago

Agreed. A context switch is not necessary to trigger the race condition, though I find it helps my mental model to picture a malicious entity that chooses to suspend execution at the worst possible moments and then work back from there :smile:.

This particular problem feels to me almost identical to a lock-free atomic shared pointer, about which there are several decent presentations:

I forget the details at the moment (sorry to be so unhelpful!), but maybe there is some inspiration there?

threadexio commented 1 month ago

I will definitely check out the talks you have linked.

I have implemented a fix for the race in https://github.com/threadexio/rcurs/commit/16d4c9ff533ae053ee74fcb0f587c9432d76fd84. Basically my fix is wrapping the problematic sections (see my comment above) with a spinlock such that only one thread has control over ptr, but only while updating the ref count. Also, spinning, in this case, I think is acceptable because it only happens if thread A and thread B enter the get and update methods at the same time. Spinning might be problematic in update because drop_inner is called inside the lock section. drop_inner will call the destructor of Inner, which will in turn call the destructor of T, which can then take an arbitrary amount of time to execute. So holding the lock for the duration of T's destructor might not be such a good idea.

Maybe the talks you have linked will provide another solution that does not involve locks.

threadexio commented 3 weeks ago

I apologize for the late reply, uni has kept me busy for a while.

This particular problem feels to me almost identical to a lock-free atomic shared pointer, about which there are several decent presentations:

https://www.youtube.com/watch?v=gTpubZ8N0no https://www.youtube.com/watch?v=lNPZV9Iqo3U

I did learn about an algorithm called "split-reference counting", however I could not figure out if or how it can solve this race.

Anyways, I have rewritten the fix with the spinlock and have added a test that tests for exactly this race in fc60b6b. My fix seems to pass that test, however I would like another pair or eyes to verify that. @hildebrandmw

It seems critical that only Rcu::update has the capacity to deallocate the backing memory. This means we now need a mechanism to wait in update until some event (all references getting dropped) to occur. I foresee a comeback of the Notify interface from 0.1.0, or something similar. Another path worth exploring, in my opinion, is to either make update async or offer an async version of update so the RCU can play nice with that whole ecosystem.

For testing purposes, all blocking now is done via spinning with core::hint::spin_loop(), not exactly an elegant solution. This is why we need something like the Notify interface.

threadexio / rcurs

Race Condition which can lead to Undefined Behavior #1