Open hildebrandmw opened 1 month ago
So I have managed to replicate the race condition by inserting a hacky sleep
inside the get
method between lines 2 and 3 and then update the RCU from another thread.
I have examined the race condition a bit. The easiest fix for this would probably be wrapping the problematic lines inside a critical-section
to prevent preemption. But I think this fix would not be testable with my current sleep
hack. Maybe something can be done with clever usage of atomics. More investigation is needed.
I will examine this further tomorrow as it has been a long day.
I think the race can happen even if no preemption takes place.
Suppose thread A is executing the get
:
https://github.com/threadexio/rcurs/blob/c4728e8e7b0d812bc62ef808066ae27ae48e2332/src/rcu.rs#L58-L62
Now suppose thread B is executing update
:
https://github.com/threadexio/rcurs/blob/c4728e8e7b0d812bc62ef808066ae27ae48e2332/src/rcu.rs#L38-L42
If thread A loads the pointer to inner and then thread B swaps it for the new one and calls drop_inner
before thread A has time to call take_ref
, then the inner structure is dropped by thread B but A still has a pointer to it. And then A tries to increment the ref count and the bug happens.
Therefore, wrapping get
in a critical section does not solve the issue, just makes it way less likely to occur.
Agreed. A context switch is not necessary to trigger the race condition, though I find it helps my mental model to picture a malicious entity that chooses to suspend execution at the worst possible moments and then work back from there :smile:.
This particular problem feels to me almost identical to a lock-free atomic shared pointer, about which there are several decent presentations:
I forget the details at the moment (sorry to be so unhelpful!), but maybe there is some inspiration there?
I will definitely check out the talks you have linked.
I have implemented a fix for the race in https://github.com/threadexio/rcurs/commit/16d4c9ff533ae053ee74fcb0f587c9432d76fd84. Basically my fix is wrapping the problematic sections (see my comment above) with a spinlock such that only one thread has control over ptr
, but only while updating the ref count. Also, spinning, in this case, I think is acceptable because it only happens if thread A and thread B enter the get
and update
methods at the same time. Spinning might be problematic in update
because drop_inner
is called inside the lock section. drop_inner
will call the destructor of Inner
, which will in turn call the destructor of T
, which can then take an arbitrary amount of time to execute. So holding the lock for the duration of T
's destructor might not be such a good idea.
Maybe the talks you have linked will provide another solution that does not involve locks.
I apologize for the late reply, uni has kept me busy for a while.
This particular problem feels to me almost identical to a lock-free atomic shared pointer, about which there are several decent presentations:
https://www.youtube.com/watch?v=gTpubZ8N0no https://www.youtube.com/watch?v=lNPZV9Iqo3U
I did learn about an algorithm called "split-reference counting", however I could not figure out if or how it can solve this race.
Anyways, I have rewritten the fix with the spinlock and have added a test that tests for exactly this race in fc60b6b. My fix seems to pass that test, however I would like another pair or eyes to verify that. @hildebrandmw
It seems critical that only Rcu::update
has the capacity to deallocate the backing memory. This means we now need a mechanism to wait in update
until some event (all references getting dropped) to occur. I foresee a comeback of the Notify
interface from 0.1.0
, or something similar. Another path worth exploring, in my opinion, is to either make update
async or offer an async version of update
so the RCU can play nice with that whole ecosystem.
For testing purposes, all blocking now is done via spinning with core::hint::spin_loop()
, not exactly an elegant solution. This is why we need something like the Notify
interface.
This is in reference this: https://github.com/threadexio/rcurs/blob/c4728e8e7b0d812bc62ef808066ae27ae48e2332/src/rcu.rs#L58-L62
The race condition exists between lines 2 and 3. Suppose thread A finishes line 2, acquiring a pointer to
inner
and is then pre-empted by the OS before it can increment the ref-count. While thread A is asleep, thread B invokesupdate
, swaps the pointer, decrements the old ref-count, and deletes the old object.When thread A resumes,
inner
is now dangling and line 3 triggers a segfault (if you are lucky) or writes randomly to memory (if you are not).