mbdavid / LiteDB

LiteDB - A .NET NoSQL Document Store in a single data file
http://www.litedb.org
MIT License
8.62k stars 1.25k forks source link

Simplified DiskWriterQueue with blocking concurrency #2411

Closed ltetak closed 9 months ago

ltetak commented 9 months ago

It is relatively easy to put the DiskWriterQueue into a state where it does nothing. It is caused by mismatches where the logic does not track properly which _task is the current one. It has many problems:

e.g. https://github.com/mbdavid/LiteDB/issues/2307

My repro steps were to run a lot of Inserts and Deletes in parallel (to fill up the disk queue). Then every couple of seconds run _db.Checkpoint() to force full db lock and Wait() invocation.

Fix is to use a much simpler blocking approach (one thread is dedicated to this). It is a good tradeoff IMO for now. It can be later replaced with an awaitable mutex version. Edit: I added an async version of the semaphore which does not block the thread.

mbdavid commented 9 months ago

Thanks! This are an old code that must be updated.

jdtkw commented 8 months ago

Thanks @ltetak - this indeed resolved our isue (#2307 - I work with @dgodwin1175), but v5.0.18 and v5.0.19 causes us to hit #2435 prior to being able to validate this with an official build. A custom build of #2436 on top of v5.0.19 (that includes #2411) seems to indicate that we can have a stable solution.

ltetak commented 8 months ago

hi @jdtkw, transaction (and especially AutoTransaction class) was the next thing I wanted to take a look at. I know about a couple of problems there.

  1. AutoTransaction can fail when reverting the transaction - this is bad by itself but it's double-bad because it hides the original exception.
  2. Error handling in transactions is wrong causing wrong counts. #2436 may be a fix to it but we need to be sure the DB is in a good state. There are a lot of "ENSURE" errors. My guess is that some transaction does not return the DB to a valid state and it breaks it. We run the database in single threaded mode (we serialize every access to the db by locks) so it must be either a problem in the algorithm somewhere or some external exception. I have some evidence that external exceptions make this problem much worse so I would start there - it means if you have an unstable storage medium causing random exceptions it may lead to a corrupted database (which should not happen thanks to the journal approach).