Open cristianoalvesdasilva opened 4 years ago
@cristianoalvesdasilva Could you provide a datafile where this happened?
@lbnascimento - Unfortunately I can't - There's sensitive data in it.
@cristianoalvesdasilva What's really weird about your stack trace is that apparently there were two exceptions thrown.
Take a look at the LiteEngine ctor. LiteEngine.Dispose(bool)
was called right after the ctor, and this only happens in the catch
block. The catch
block disposes the engine and runs a checkpoint, throwing the original exception back at the end. Except this never happened because another exception was thrown during the checkpoint (more specifically, during the FileStream.Seek
).
If you could try to open the datafile in debug mode and find out what was the original exception, that would be helpful.
@lbnascimento - From what I could see, all of the report issues happen when the user's previously 'killed' the app (swiping it away in Android).
I tried to reproduce it locally, and noticed when a call to LiteDB.Dispose()
happens, it kinda gets stuck (looks like some sort of internal dead lock):
There you go how we dispose the aforementioned db instance:
public void Dispose()
{
if (this.DatabaseLazy.IsValueCreated)
{
this.Database.Dispose();
}
}
My debug output:
"<unnamed thread>" at LiteDB.Engine.AesStream.Read (byte[],int,int) [0x00000] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.DiskService/<ReadFull>d__22.MoveNext () [0x000ab] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.WalIndexService/<>c__DisplayClass19_0/<<CheckpointInternal>g__source|0>d.MoveNext () [0x000d0] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.DiskService.Write (System.Collections.Generic.IEnumerable`1<LiteDB.Engine.PageBuffer>,LiteDB.Engine.FileOrigin) [0x00068] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.WalIndexService.CheckpointInternal () [0x00031] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.WalIndexService.TryCheckpoint () [0x00030] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.LiteEngine.Dispose (bool) [0x00045] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.Engine.LiteEngine.Dispose () [0x00002] in <7b8c154640e6442cbc93e4644732d128>:0
at LiteDB.LiteDatabase.Dispose (bool) [0x00011] in <7b8c154640e
6442cbc93e4644732d128>:0
at LiteDB.LiteDatabase.Dispose () [0x00002] in <7b8c154640e6442cbc93e4644732d128>:0
at App.Mobile.Infra.Offline.TransientRecordRepository.Dispose () [0x00011] in C:\java\gitlab\dfsm\source\App\App.Mobile.Infra.Offline\TransientRecordRepository.cs:220
at App.Mobile.Infra.Offline.TransientRecordSynchronizer.Dispose () [0x0000c] in C:\java\gitlab\dfsm\source\App\App.Mobile.Infra.Offline\TransientRecordSynchronizer.cs:370
at App.Mobile.Infra.Services.Offline.OfflineManager.Dispose () [0x00008] in C:\java\gitlab\dfsm\source\App\App.Mobile.Infra.Services\Offline\OfflineManager.cs:104
at App.Mobile.Ui.Services.Boostrap.AppManager.Dispose () [0x00022] in C:\java\gitlab\dfsm\source\App\App.Mobile.Ui.Services\Boostrap\AppManager.cs:132
at App.Mobile.Ui.Services.Boostrap.AppManager/<ShutdownAsync>d__57.MoveNext () [0x0009b] in C:\java\gitlab\dfsm\source\App\App.Mobile.Ui.Services\Boostrap\AppMan
ager.cs:153
at System.Runtime.CompilerServices.AsyncMethodBuilderCore/MoveNextRunner.InvokeMoveNext (object) [0x00000] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mcs/class/referencesource/mscorlib/system/runtime/compilerservices/AsyncMethodBuilder.cs:1092
at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00071] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mcs/class/referencesource/mscorlib/system/threading/executioncontext.cs:968
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00000] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mcs/class/referencesource/mscorlib/system/threading/executioncontext.cs:910
at System.Runtime.CompilerServices.AsyncMethodBuilderCore/MoveNextRunner.Run () [0x00024] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/relea
se/mcs/class/referencesource/mscorlib/system/runtime/compilerservices/AsyncMethodBuilder.cs:1073
at System.Threading.Tasks.SynchronizationContextAwaitTaskContinuation/<>c.<.cctor>b__7_0 (object) [0x00000] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/external/corert/src/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs:379
at Android.App.SyncContext/<>c__DisplayClass2_0.<Post>b__0 () [0x0000c] in <eaa205f580954a64824b74a79fa87c62>:0
at Java.Lang.Thread/RunnableImplementor.Run () [0x0000e] in <eaa205f580954a64824b74a79fa87c62>:0
at Java.Lang.IRunnableInvoker.n_Run (intptr,intptr) [0x00008] in <eaa205f580954a64824b74a79fa87c62>:0
at (wrapper dynamic-method) Android.Runtime.DynamicMethodNameCounter.1 (intptr,intptr) [0x00011] in <eaa205f580954a64824b74a79fa87c62>:0
at (wrapper native-to-managed) Android.Runtime.DynamicMethodNameCounter.1 (intptr,intptr) [0x00033] in <eaa205f580954a64824b74a79fa87c62>:0
Eventually, the OS itself will simply forcibly kill the app since it's stuck...then, what I noticed (also got quite a few user reports about it) is data loss - Data that was there before killing the app is no longer available once we re-launch the application. My bet is: since the Dispose
could not fulfill its job of persisting the checkpoints back to the data file, it somewhat got lost. This piece I can easily reproduce every single time Dispose
is called.
In the meantime I'm going through the code to see if there could be another culprit.
Thanks in advance!
P.S: This happens whenever there is data for the checkpoint to store, meaning any calls to Upsert, Remove, etc... If there are not checkpoints, Dispose
works pretty fine.
Tried to decrease the ChepointSize
already, but no luck. Also, if I manually call Checkpoint()
I also get the same result as above: it gets stuck.
P.S II: After debugging a little further I got to the very line which is causing the app to hang: https://github.com/mbdavid/LiteDB/blob/5d88c20a14f962e685d255347b3d407d38538248/LiteDB/Engine/Services/WalIndexService.cs#L291
After digging I noticed there's a "task.Wait()" which is locking against the main UI thread.
I'm guessing this is a known issue already pointed out on: https://github.com/mbdavid/LiteDB/issues/1490 https://github.com/mbdavid/LiteDB/issues/1511
Possible fix is to put ConfigureAwait(false) before waiting the task: https://github.com/mbdavid/LiteDB/blob/215135bedf93d365a8f331338f92b99c29414ccb/LiteDB/Engine/Disk/DiskWriterQueue.cs#L73
Reason behind it (from the very guy who wrote most of the .NET TPL himself): https://devblogs.microsoft.com/dotnet/configureawait-faq/#why-would-i-want-to-use-configureawaitfalse
@cristianoalvesdasilva If adding ConfigureAwait(false)
prevents it from happening, then it is definitely something to be considered, however I think it may only mitigate the issue (instead of fixing the root cause).
However, I have a suspicion regarding what happened. If you're not willing to send me your data and log files, would you be at least willing to send me part of them? The first 9kB of each should be enough (I don't even need the password).
@lbnascimento - In order to perform a more detailed debugging, this morning I downloaded the master branch (LiteDB) to hook it up into my application...for my surprise, I was no longer getting issues with the bits and pieces above. I know the master branch is just a few commits ahead of 5.0.9 (which I also tried with no luck) which I honestly couldn't get my ahead around.
So, ended up shipping a fresh version (with a locally compiled .DLL on top of the master branch (pretty a -pre if you will)) for a few pilot users to see if the original issue keeps happening or not. If possible, I'd like to keep this bug open until Friday, tops. So, at least I can get a fresh database from the field (the users will run a fresh install which deletes the existing .db) and share with you.
Sounds good?
Apart from that, thank you so much for your prompt responses so far!
@lbnascimento - Well, it turns out we're still getting the original error System.IO.IOException: Win32 IO returned 25
, but we're having a hard time to get our hands on the .db files. Hopefully soon enough we should ship a new version where collecting the db remotely is possible. For the time being we'be disabled the auto-checkpoint (by using CheckpointSize = 0
) and we're manually calling it when needed.
Should you want to close this for now given I could not provide you with a file, please be my guest.
If I may, I just wanted to share a few suggestions:
AggregateException
to accomplish that:catch (Exception ex)
{
LOG(ex.Message, "ERROR");
try
{
// explicit dispose (but do not run shutdown operation)
this.Dispose(true);
}
catch (Exception dex)
{
throw new AggregateException(ex, dex);
}
throw;
}
Anyway, I think it's a good change to be here: https://github.com/mbdavid/LiteDB/blob/81b3970eaaca366e17ecc0e4d179f448191b1f8f/LiteDB/Engine/LiteEngine.cs#L109
Current code:
/// <summary>
/// Shutdown process:
/// - Stop any new transaction
/// - Stop operation loops over database (throw in SafePoint)
/// - Wait for writer queue
/// - Close disks
/// </summary>
protected virtual void Dispose(bool disposing)
{
// this method can be called from Ctor, so many
// of this members can be null yet (even if are readonly).
if (_disposed) return;
if (disposing)
{
// stop running all transactions
_monitor?.Dispose();
// do a soft checkpoint (only if exclusive lock is possible)
if (_header?.Pragmas.Checkpoint > 0) _walIndex?.TryCheckpoint();
// close all disk streams (and delete log if empty)
_disk?.Dispose();
// delete sort temp file
_sortDisk?.Dispose();
// dispose lockers
_locker?.Dispose();
}
LOG("engine disposed", "ENGINE");
_disposed = true;
}
}
When the line below fails, you end up not disposing _disk
, _sortDisk
and _locker
- which in turn leaves dead streams in memory. You can and should report any error while disposing, but ideally all unused resources should go away first and then finally you can throw whatever exception(s) you caught.
if (_header?.Pragmas.Checkpoint > 0) _walIndex?.TryCheckpoint();
@cristianoalvesdasilva did you solved this?
Perhaps disposing _disk
, _sortDisk
, _locker
in case of exceptions too?
@cristianoalvesdasilva did you solved this? Perhaps disposing
_disk
,_sortDisk
,_locker
in case of exceptions too?
The issue itself is still intermittent. My attempt here was more towards ensuring no leftovers remain in memory.
LiteDB 5.0.8 Xamarin Forms 4.7.0.1239
Hello,
We've recently launched a brand new version of our app and we realized some crashing reports in AppCenter when first trying to open a LiteDb instance.
We're using LiteDB is 'Direct' mode, where the instance is created only once per app lifetime (resides into a Singleton object).
That's the error stack trace we got from our crashing reports:
Environment specs we saw this happening on: OS: Android 10 Device: SM-G975F (SAMSUNG)
One suspicious I had was the database size, but the app was freshly installed and we store just a couple of JSON entries a day.
We're not able to reproduce in our test lab, but nevertheless I'm seeking for your help here. Not sure if you ever ran into this issue before...or if there's some sort of routine/method we could be calling on top of LiteDB to turn the database healthy once again, because it seems once this error happens, the database gets corrupted.
Thanks in advance! Cris