kanidm / concread

Concurrently Readable Data Structures for Rust
Mozilla Public License 2.0
339 stars 15 forks source link

RFC: Make LinCowCell::read sync as it never blocks #111

Closed adamreichold closed 6 months ago

adamreichold commented 6 months ago

Using MVCC, readers never have to wait for writers and hence opening a read transaction never blocks. We should therefore be able to make LinCowCell::read sync by using a SyncMutex for its active field. This then implies that LinCowCellWriteTxn::commit can also become sync and onyl the potentially blocking LinCowCell::write has to stay async so that one writer can yield to others.

This does imply a change to the test_concurrent_create test case as the readers are now fully blocking code that never yields to the executor and the first one will hence starve all other readers and the writers so that the test case never finishes. This is mitigating by using spawn_blocking for these actually blocking tasks. Alternatively, they could be forced to yield using yield_now or spawned onto a multi-threaded runtimes with at least one more worker thread than reader tasks.

Firstyear commented 6 months ago

Honestly, now I'm curious why I did it this way in the past. I suspect it was so that if you have two tasks trying to take a write, then they queue correctly and can await on the call to .write. What I want to avoid is the tasks repeatedly spinning on the .lock() until they know they "probably" will get the lock.

I'm wondering if we could achieve similar with some other construction while retaining the syncMutex. Perhaps something like:

loop {
    if let Ok(guard) = self.lock() {
        break guard;
    } else {
        // either a yield/yield_now, or some kind of wait on a queue or something? 
    } 
}

We could probably use https://docs.rs/tokio/latest/tokio/sync/struct.Notify.html to achieve this. Finally the problem here is what happens if we drop the write without a commit? We need to ensure that we call notify on drop so that waiting tasks wake up.

I think the other possible issue here is then we have a mutex being held in the write txn over a possible await point which clippy can warn about.

adamreichold commented 6 months ago

Honestly, now I'm curious why I did it this way in the past. I suspect it was so that if you have two tasks trying to take a write, then they queue correctly and can await on the call to .write. What I want to avoid is the tasks repeatedly spinning on the .lock() until they know they "probably" will get the lock.

I think they will queue on the write Mutex which is still a fair/FIFO async mutex even after this change? I only modified the read side which should never have to wait for the writers to make progress.

And indeed the active lock which became sync here is only held for short/limited/internal critical sections, once in read but only to clone an Arc and once in commit to publish the new generation.

But AFAIU, the part in commit will only ever fight against readers as only the single writer already having acquired write will be committing at any point in time.

Firstyear commented 6 months ago

Honestly, now I'm curious why I did it this way in the past. I suspect it was so that if you have two tasks trying to take a write, then they queue correctly and can await on the call to .write. What I want to avoid is the tasks repeatedly spinning on the .lock() until they know they "probably" will get the lock.

I think they will queue on the write Mutex which is still a fair/FIFO async mutex even after this change? I only modified the read side which should never have to wait for the writers to make progress.

Ahh true. I made the mistake of assuming you were changing the write side too.

And indeed the active lock which became sync here is only held for short/limited/internal critical sections, once in read but only to clone an Arc and once in commit to publish the new generation.

But AFAIU, the part in commit will only ever fight against readers as only the single writer already having acquired write will be committing at any point in time.

Yep, and considering in those cases the whole process doesn't block or await, we should be good. Let me check the concerns in the test case and we go from there.

Firstyear commented 6 months ago

Given this does change the behaviour to remove the async I think we will need a major version bump of the lib before we release it.

adamreichold commented 6 months ago

Given this does change the behaviour to remove the async I think we will need a major version bump of the lib before we release it.

Indeed this is a semver-breaking change. One could play games with keep read async (without actually awaiting anything) and adding a read_sync variant for the actual implementation, but considering that the downstream changes to adjust to this should be as trivial as removing .await, it do not think this is worth it and would also just go for the breaking release, especially since concread is still a pre-1.0 crate.

Also while testing this in our code base using a Git dependency, I found one paper cut, the fix of which is #112 ensuring that migration is painless. (For anyone like me already using Serde support in 0.4.5 and 0.4.6.)

Firstyear commented 6 months ago

@adamreichold Okay, I'll upgrade to the git release in some other projects that heavily rely on this crate to check for any other issues before we go ahead and do a major release.

Firstyear commented 6 months ago

@adamreichold Okay, tested with no issues. So I'm happy to go ahead and do a major release provided there aren't any other major API changes you want to make :)

adamreichold commented 6 months ago

I also hope to get our downstream usage reviewed and merged using a Git dependency tomorrow and would deploy it to our staging environment for a few days.

So I do not have any additional changes planned but I could also offer some production-like testing to increase our confidence in shipping a 0.5.0 release next week? But if you already are sufficiently confident, I certainly won't mind using a crates.io dependency either.

Firstyear commented 6 months ago

Next week sounds great to me. I'm already very confident having reviewed the code, and the test cases are very exhaustive so if they passed were already in a good spot. I'll wait for your production-like tests to do 0.5.0 next week then.

adamreichold commented 6 months ago

Just wanted to say that the new code has been running since the middle of last week without issue for us. I will keep it running until 0.5.0 is released and speak up if anything unexpected happens, but for now everything looks good to me.

Firstyear commented 6 months ago

Thank you! Release made!