mbdavid / LiteDB

LiteDB - A .NET NoSQL Document Store in a single data file
http://www.litedb.org
MIT License
8.52k stars 1.24k forks source link

Don't capture synchronization context in DiskWriterQueue #2447

Closed alexbereznikov closed 3 months ago

alexbereznikov commented 6 months ago

This addresses #2446

Task.Factory.StartNew tries to use task scheduler from the current task, so TaskScheduler.Default needs to be passed here.

alexbereznikov commented 3 months ago

@pictos As I described in related issue (#2446), breaking change was introduced after 5.0.17 - Task.Run (which uses default task scheduler) has been changed to Task.Factory.StartNew (that uses a current task scheduler by default). More precisely, breaking changes were made in this commit - https://github.com/mbdavid/LiteDB/commit/f21cd8405eda7c7b59b018c9a541b0c3605eb5da

So this PR aims to revert this breaking change, not to introduce one.

Regardless of whether our custom context is implemented correctly, I don't think it is a good idea to execute writer queue on a captured context (e g UI context). I would say, it rather highlights the issue.

pictos commented 3 months ago

So, looking more closely to the implementation I strongly believe that the issue is more than having it capturing the current task's TaskScheduler. The task is being created with LongRunning flag, and uses an async delegate, and you may know that those two doesn't fit togheter.

So the first thing we should change here is to make the ExecuteQueue a void method, and block the created thread on the async operations or rethink all this engine, which I believe will not be done on this PR if we choose this path.

edit: Did a PoC using the main implementation and it will not flow the UI context, adding the TaskScheduler or not will not change the behavior, will just go throw UI thread if you pass TaskScheduler.FromCurrentSynchronizationContext(), so I would say that your PR will not fix the issue at the end

alexbereznikov commented 3 months ago

@pictos I saw this discussion (until I was removed from #moderators channel :)). Yes this could solve the issue as well, but what are real benefits of having a separate thread for this? I mean, this is not a like windows message loop thread, where we have to keep the same thread id. From my point of view, having a separate thread here have some drawbacks:

I agree that LongRunning probably is redundant here, we can just as well revert to Task.Run.

Regarding PoC - I couldn't reproduce the issue in a small PoC either, but I'm 100% sure that our integration tests started to fail because of this regression, and this fix fixed this for us (we have a fork with some fixes applied).

pictos commented 3 months ago

@alexbereznikov, let me answer your comments in pieces.

Yes this could solve the issue as well, but what are real benefits of having a separate thread for this? I mean, this is not a like windows message loop thread, where we have to keep the same thread id

First of all, we don't need to manage the dedicated thread, that BCL will do it for us, since we're using the Task.Factory.StartNew. So, we don't need to manage the thread, we are handling with the Task, which is a high-level abstraction.

Loosing ability to use async primitives - roughly half of the time the thread will do nothing while waiting for event

We don't need the async primitives in this specific case. We want to have a dedicated thread to handle the DiskWriter work.

I agree that LongRunning probably is redundant here, we can just as well revert to Task.Run.

I didn't say the LongRunning is redundant, I said that passing the TaskScheduler as parameter is redundant. And we can't use Task.Run here, because that way we will block a ThreadPool thread to do a hard work, and that's wrong by design. The threads on the ThreadPool aren't designed to do a heavy work like that. That said, the usage of the Task.Factory.StartNew is a good approach here.

Regarding PoC - I couldn't reproduce the issue in a small PoC either, but I'm 100% sure that our integration tests started to fail because of this regression, and this fix fixed this for us (we have a fork with some fixes applied).

Friendly saying. And still there you're confident that the custom SyncronizationContext implemented on your tests doesn't have an issue?

alexbereznikov commented 3 months ago

@pictos Ok, we do use sync I/O operations here and thread pool is really not the best choice in this case.

But, what if we change WritePageToStream to use WriteAsync instead of Write, and FlushToDisk extension to use FlushAsync instead of Flush? In this case, we won't block thread pool thread to do a hard work - its responsibility will be pass pages from queue to _stream.WriteAsync. Also, thread won't wait on WaitAsync.

Regarding custom SyncronizationContext - it works perfectly fine, it is not a "test" context, we do use it in the main app. The only issue it has is that it is disposable, and can perform only one operation at a time. So the issue we had initially was when continuation from ExecuteQueue got posted to already disposed synchronization context.

pictos commented 3 months ago

@alexbereznikov, why are so against to use a dedicated thread for it? It's the right tool this job.

For this PR at this time we have plan to not change how it works in core, so I'll follow the Mauricio's recommendations. And feel free to open an issue proposing this change, if it's approved we can merge it in the next major release.

alexbereznikov commented 3 months ago

@pictos I just feel that it is a step in a backwards direction. I don't think that changing Write to WriteAsync can break anything.

Regarding custom sync context and PoC - please take a look at this - https://github.com/alexbereznikov/LiteDB/tree/2446-poc

Here I added a new console project called Poc, which uses a custom synchronization context and a task scheduler created from it. It outputs all the stack traces for each call to Send/Post, and it is clear that ExecuteQueue calls got redirected to this context.

Context itself does not flow, for sure

pictos commented 3 months ago

@alexbereznikov I've a feeling that you aren't listening to me and just want to make your point pass. What you did on your PoC just confirms what I said here, what in resume is "the only way that your custom SyncContext will flow to the custom Task is if you call TaskScheduler.FromSyncronizationContext(). Passing TaskScheduler.Default or omitting the call will not touch on the SyncContext, no matter what's".

As peer discussion with @mbdavid, for now we will not change the approach. But feel free to open an issue where you can put more info and we can discuss another approach for this, but remember it must be ThreadSafe.

alexbereznikov commented 3 months ago

@pictos But in this case Task.Factory.StartNew uses a current task scheduler, which in turn uses that context. I don't think it is a correct behavior, because in this PoC I didn't change anything in LiteDB sources, i e haven't passed this TaskScheduler.FromSyncronizationContext() when starting ExecuteQueue. And passing TaskScheduler.Default fixes this issue.

alexbereznikov commented 3 months ago

I'm honestly surprised a lot that such a simple one-line regression fix spawned that much discussion. If you insist I'll do in the way you proposed because I have no other options :) But I'm still 100% sure that this is a step in a backwards direction - instead of fixing the issue and moving on we're rolling back to ways it was done a while ago.

pictos commented 3 months ago

@alexbereznikov, let's try one more time... Today that API is sensitive to the SyncContext, so can be devs that rely on that. Changing this behavior is a breaking change that we can't do in a SR.

So, for this PR, that will be released in a SR we need to fix the fact that right now the task is leaking to the ThreadPool. For changing what you want, we will need to do a new PR for that and also verify on the code place other places that are sensitive to SyncContexts and change that, hope that now you understand.

Thanks for your help and bringing this to light. I did the necessary changes for created #2505 to follow up the next changes. If you want to take a lead on that and change all places on our code base to be SyncContext agnostisc let us now and we will assign the issue to you.