Closed oleksii-datsiuk closed 4 months ago
In addition to fixing the problem described in the issue, this PR has 2 more improvements:
In Snapshot.NewPage() there is comment: // there is need for _header.Savepoint() because changes here will incremental and will be persist later // if any problem occurs here, rollback will catch this changes" However there is no savepoint in this function (and according to the history never was). So I added savepoint here. But not exactly in the place where comment is written, but in the "else" block inside this function, where savepoint seems to be important.
I made WalIndexService.CurrentReadVersion synchronized. Its backing field (_currentReadVersion) is accessed (read and modified) in the same class in Clear() and ConfirmTransaction() functions. These functions are atomic by using _indexLock. Especially ConfirmTransaction is interesting, because it first increases _currentReadVersion and then adds some data based on this increased value to the WAL index. So it seems unsafe to be able to access WalIndexService.CurrentReadVersion property while ConfirmTransaction() is still being executed.
Hi @oleksii-datsiuk,
Today I managed to do the tests I needed to better understand the problem. I could see the competition problem occurring with your example. I want to thank you again for your attention and effort, as I know it is a very complex job. The whole community thanks you!
Hi @mbdavid
Thank you very much for your appreciation! In fact, we worked on this issue together with Volodymyr Dombrovsky (https://github.com/dombrovsky), so we'll share thanks together ))
Honestly, we had almost no choice other than to investigate this problem and try to fix it. Our product strongly depends on LiteDB, and the more clients are using it, the more we see problems with DB corruption.
After creating this PR I also have found that the fix provided here solves not only scenarios described in original issue (https://github.com/mbdavid/LiteDB/issues/2503), but also some other related scenarios that finally may cause DB corruption. So for us it was highly important to have this fix. Right now we are using our own fork from version 5.0.17 where we apply our fixes. We had some additional problems with 5.0.19, so we are not risking to switch to the latest version right now.
Btw, I'm not sure that you noticed one more issue https://github.com/mbdavid/LiteDB/issues/2480 and PR https://github.com/mbdavid/LiteDB/pull/2481, as it was merged not by you. There we have discovered another scenario in which we were getting DB corruption quite often. My PR prevents DB corruption by preventing usage of disposed snapshots, but maybe there could be some better solution that will also allow problematic use-case to work properly (see issue description).
We also would like to thank you for the great product. We hope that all critical issues will soon be stabilized, and we all will enjoy it to the fullest.
This PR fixes "ENSURE: empty page must be defined as empty type" error which is described in this issue: https://github.com/mbdavid/LiteDB/issues/2503