[BUG] - System.IO.IOException: Win32 IO returned 25

cristianoalvesdasilva commented 4 years ago

LiteDB 5.0.8 Xamarin Forms 4.7.0.1239

Hello,

We've recently launched a brand new version of our app and we realized some crashing reports in AppCenter when first trying to open a LiteDb instance.

We're using LiteDB is 'Direct' mode, where the instance is created only once per app lifetime (resides into a Singleton object).

private Lazy<LiteDatabase> DatabaseLazy { get; }

private LiteDatabase Database => this.DatabaseLazy.Value;

public TransientRecordRepository(ISchemaNameProvider schemaNameProvider)
{
  this.SchemaNameProvider = schemaNameProvider;
  this.DatabaseLazy = new Lazy<LiteDatabase>(() => this.CreateConnection(FileSystem.AppDataDirectory), LazyThreadSafetyMode.ExecutionAndPublication);
}

...

private LiteDatabase CreateConnection(string baseDirectory)
{
  const string DbNameFile = "offline.db";
  var database = new LiteDatabase($"FileName={Path.Combine(baseDirectory, DbNameFile)};Password=...", new TransientRecordSerializationMapper());
  var collection = database.GetCollection<TransientRecord>();

  // Creating index.
  collection.EnsureIndex(_ => _.Key);
  collection.EnsureIndex(_ => _.Group);
  collection.EnsureIndex(_ => _.Status);
  collection.EnsureIndex(_ => _.Owner);
  collection.EnsureIndex(_ => _.Schema);

  return database;
}

That's the error stack trace we got from our crashing reports:

System.IO.IOException: Win32 IO returned 25. Path: /data/user/0/App.Mobile.Android/files/offline.db
  at System.IO.FileStream.Seek (System.Int64 offset, System.IO.SeekOrigin origin) [0x000bd] in <c5ec781d4140439c90c5e21f6abebca3>:0 
  at LiteDB.Engine.AesStream.Seek (System.Int64 offset, System.IO.SeekOrigin origin) [0x00000] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.Engine.AesStream.set_Position (System.Int64 value) [0x00000] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.Engine.DiskService.Write (System.Collections.Generic.IEnumerable`1[T] pages, LiteDB.Engine.FileOrigin origin) [0x00044] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.Engine.WalIndexService.CheckpointInternal () [0x00030] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.Engine.WalIndexService.TryCheckpoint () [0x0002f] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.Engine.LiteEngine.Dispose (System.Boolean disposing) [0x00043] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.Engine.LiteEngine..ctor (LiteDB.Engine.EngineSettings settings) [0x0016c] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.ConnectionString.CreateEngine () [0x0004a] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.LiteDatabase..ctor (LiteDB.ConnectionString connectionString, LiteDB.BsonMapper mapper) [0x0002e] in <7b8c154640e6442cbc93e4644732d128>:0 
  at LiteDB.LiteDatabase..ctor (System.String connectionString, LiteDB.BsonMapper mapper) [0x00007] in <7b8c154640e6442cbc93e4644732d128>:0 
  at App.Mobile.Infra.Offline.TransientRecordRepository.CreateConnection (System.String baseDirectory) [0x0001f] in <d2778a08005f4e6b83a31c51545cd890>:0 
  at App.Mobile.Infra.Offline.TransientRecordRepository.<.ctor>b__8_0 () [0x00006] in <d2778a08005f4e6b83a31c51545cd890>:0 
  at System.Lazy`1[T].ViaFactory (System.Threading.LazyThreadSafetyMode mode) [0x0001c] in <c5ec781d4140439c90c5e21f6abebca3>:0

Environment specs we saw this happening on: OS: Android 10 Device: SM-G975F (SAMSUNG)

One suspicious I had was the database size, but the app was freshly installed and we store just a couple of JSON entries a day.

We're not able to reproduce in our test lab, but nevertheless I'm seeking for your help here. Not sure if you ever ran into this issue before...or if there's some sort of routine/method we could be calling on top of LiteDB to turn the database healthy once again, because it seems once this error happens, the database gets corrupted.

Thanks in advance! Cris

lbnascimento commented 4 years ago

@cristianoalvesdasilva Could you provide a datafile where this happened?

cristianoalvesdasilva commented 4 years ago

@lbnascimento - Unfortunately I can't - There's sensitive data in it.

lbnascimento commented 4 years ago

@cristianoalvesdasilva What's really weird about your stack trace is that apparently there were two exceptions thrown.

Take a look at the LiteEngine ctor. LiteEngine.Dispose(bool) was called right after the ctor, and this only happens in the catch block. The catch block disposes the engine and runs a checkpoint, throwing the original exception back at the end. Except this never happened because another exception was thrown during the checkpoint (more specifically, during the FileStream.Seek).

If you could try to open the datafile in debug mode and find out what was the original exception, that would be helpful.

cristianoalvesdasilva commented 4 years ago

@lbnascimento - From what I could see, all of the report issues happen when the user's previously 'killed' the app (swiping it away in Android).

I tried to reproduce it locally, and noticed when a call to LiteDB.Dispose() happens, it kinda gets stuck (looks like some sort of internal dead lock):

There you go how we dispose the aforementioned db instance:

public void Dispose()
{
  if (this.DatabaseLazy.IsValueCreated)
  {
    this.Database.Dispose();
  }
}

My debug output:

"<unnamed thread>"  at LiteDB.Engine.AesStream.Read (byte[],int,int) [0x00000] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.DiskService/<ReadFull>d__22.MoveNext () [0x000ab] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.WalIndexService/<>c__DisplayClass19_0/<<CheckpointInternal>g__source|0>d.MoveNext () [0x000d0] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.DiskService.Write (System.Collections.Generic.IEnumerable`1<LiteDB.Engine.PageBuffer>,LiteDB.Engine.FileOrigin) [0x00068] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.WalIndexService.CheckpointInternal () [0x00031] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.WalIndexService.TryCheckpoint () [0x00030] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.LiteEngine.Dispose (bool) [0x00045] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.Engine.LiteEngine.Dispose () [0x00002] in <7b8c154640e6442cbc93e4644732d128>:0
  at LiteDB.LiteDatabase.Dispose (bool) [0x00011] in <7b8c154640e
6442cbc93e4644732d128>:0
  at LiteDB.LiteDatabase.Dispose () [0x00002] in <7b8c154640e6442cbc93e4644732d128>:0
  at App.Mobile.Infra.Offline.TransientRecordRepository.Dispose () [0x00011] in C:\java\gitlab\dfsm\source\App\App.Mobile.Infra.Offline\TransientRecordRepository.cs:220
  at App.Mobile.Infra.Offline.TransientRecordSynchronizer.Dispose () [0x0000c] in C:\java\gitlab\dfsm\source\App\App.Mobile.Infra.Offline\TransientRecordSynchronizer.cs:370
  at App.Mobile.Infra.Services.Offline.OfflineManager.Dispose () [0x00008] in C:\java\gitlab\dfsm\source\App\App.Mobile.Infra.Services\Offline\OfflineManager.cs:104
  at App.Mobile.Ui.Services.Boostrap.AppManager.Dispose () [0x00022] in C:\java\gitlab\dfsm\source\App\App.Mobile.Ui.Services\Boostrap\AppManager.cs:132
  at App.Mobile.Ui.Services.Boostrap.AppManager/<ShutdownAsync>d__57.MoveNext () [0x0009b] in C:\java\gitlab\dfsm\source\App\App.Mobile.Ui.Services\Boostrap\AppMan
ager.cs:153
  at System.Runtime.CompilerServices.AsyncMethodBuilderCore/MoveNextRunner.InvokeMoveNext (object) [0x00000] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mcs/class/referencesource/mscorlib/system/runtime/compilerservices/AsyncMethodBuilder.cs:1092
  at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00071] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mcs/class/referencesource/mscorlib/system/threading/executioncontext.cs:968
  at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00000] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/mcs/class/referencesource/mscorlib/system/threading/executioncontext.cs:910
  at System.Runtime.CompilerServices.AsyncMethodBuilderCore/MoveNextRunner.Run () [0x00024] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/relea
se/mcs/class/referencesource/mscorlib/system/runtime/compilerservices/AsyncMethodBuilder.cs:1073
  at System.Threading.Tasks.SynchronizationContextAwaitTaskContinuation/<>c.<.cctor>b__7_0 (object) [0x00000] in /Users/builder/jenkins/workspace/archive-mono/2020-02/android/release/external/corert/src/System.Private.CoreLib/src/System/Threading/Tasks/TaskContinuation.cs:379
  at Android.App.SyncContext/<>c__DisplayClass2_0.<Post>b__0 () [0x0000c] in <eaa205f580954a64824b74a79fa87c62>:0
  at Java.Lang.Thread/RunnableImplementor.Run () [0x0000e] in <eaa205f580954a64824b74a79fa87c62>:0
  at Java.Lang.IRunnableInvoker.n_Run (intptr,intptr) [0x00008] in <eaa205f580954a64824b74a79fa87c62>:0
  at (wrapper dynamic-method) Android.Runtime.DynamicMethodNameCounter.1 (intptr,intptr) [0x00011] in <eaa205f580954a64824b74a79fa87c62>:0
  at (wrapper native-to-managed) Android.Runtime.DynamicMethodNameCounter.1 (intptr,intptr) [0x00033] in <eaa205f580954a64824b74a79fa87c62>:0

Eventually, the OS itself will simply forcibly kill the app since it's stuck...then, what I noticed (also got quite a few user reports about it) is data loss - Data that was there before killing the app is no longer available once we re-launch the application. My bet is: since the Dispose could not fulfill its job of persisting the checkpoints back to the data file, it somewhat got lost. This piece I can easily reproduce every single time Dispose is called.

In the meantime I'm going through the code to see if there could be another culprit.

Thanks in advance!

cristianoalvesdasilva commented 4 years ago

P.S: This happens whenever there is data for the checkpoint to store, meaning any calls to Upsert, Remove, etc... If there are not checkpoints, Dispose works pretty fine.

Tried to decrease the ChepointSize already, but no luck. Also, if I manually call Checkpoint() I also get the same result as above: it gets stuck.

cristianoalvesdasilva commented 4 years ago

P.S II: After debugging a little further I got to the very line which is causing the app to hang: https://github.com/mbdavid/LiteDB/blob/5d88c20a14f962e685d255347b3d407d38538248/LiteDB/Engine/Services/WalIndexService.cs#L291

After digging I noticed there's a "task.Wait()" which is locking against the main UI thread.

I'm guessing this is a known issue already pointed out on: https://github.com/mbdavid/LiteDB/issues/1490 https://github.com/mbdavid/LiteDB/issues/1511

Possible fix is to put ConfigureAwait(false) before waiting the task: https://github.com/mbdavid/LiteDB/blob/215135bedf93d365a8f331338f92b99c29414ccb/LiteDB/Engine/Disk/DiskWriterQueue.cs#L73

Reason behind it (from the very guy who wrote most of the .NET TPL himself): https://devblogs.microsoft.com/dotnet/configureawait-faq/#why-would-i-want-to-use-configureawaitfalse

lbnascimento commented 4 years ago

@cristianoalvesdasilva If adding ConfigureAwait(false) prevents it from happening, then it is definitely something to be considered, however I think it may only mitigate the issue (instead of fixing the root cause).

However, I have a suspicion regarding what happened. If you're not willing to send me your data and log files, would you be at least willing to send me part of them? The first 9kB of each should be enough (I don't even need the password).

cristianoalvesdasilva commented 4 years ago

@lbnascimento - In order to perform a more detailed debugging, this morning I downloaded the master branch (LiteDB) to hook it up into my application...for my surprise, I was no longer getting issues with the bits and pieces above. I know the master branch is just a few commits ahead of 5.0.9 (which I also tried with no luck) which I honestly couldn't get my ahead around.

So, ended up shipping a fresh version (with a locally compiled .DLL on top of the master branch (pretty a -pre if you will)) for a few pilot users to see if the original issue keeps happening or not. If possible, I'd like to keep this bug open until Friday, tops. So, at least I can get a fresh database from the field (the users will run a fresh install which deletes the existing .db) and share with you.

Sounds good?

Apart from that, thank you so much for your prompt responses so far!

cristianoalvesdasilva commented 4 years ago

@lbnascimento - Well, it turns out we're still getting the original error System.IO.IOException: Win32 IO returned 25, but we're having a hard time to get our hands on the .db files. Hopefully soon enough we should ship a new version where collecting the db remotely is possible. For the time being we'be disabled the auto-checkpoint (by using CheckpointSize = 0) and we're manually calling it when needed.

Should you want to close this for now given I could not provide you with a file, please be my guest.

If I may, I just wanted to share a few suggestions:

I've just shipped a v5.0.9.2 to collect both original exception (thrown while constructing the Engine) and following exception thrown while disposing the Engine. I'm using an AggregateException to accomplish that:

catch (Exception ex)
{
  LOG(ex.Message, "ERROR");

  try
  {
    // explicit dispose (but do not run shutdown operation)
    this.Dispose(true);
  }
  catch (Exception dex)
  {
    throw new AggregateException(ex, dex);
  }

  throw;
}

Anyway, I think it's a good change to be here: https://github.com/mbdavid/LiteDB/blob/81b3970eaaca366e17ecc0e4d179f448191b1f8f/LiteDB/Engine/LiteEngine.cs#L109

Given the 'second' exception is happening when attempting to dispose the not yet created LiteEngine instance, IMO I'd say you need to ensure those already created resources are disposed regardless an error.

https://github.com/mbdavid/LiteDB/blob/81b3970eaaca366e17ecc0e4d179f448191b1f8f/LiteDB/Engine/LiteEngine.cs#L150

Current code:

        /// <summary>
        /// Shutdown process:
        /// - Stop any new transaction
        /// - Stop operation loops over database (throw in SafePoint)
        /// - Wait for writer queue
        /// - Close disks
        /// </summary>
        protected virtual void Dispose(bool disposing)
        {
            // this method can be called from Ctor, so many 
            // of this members can be null yet (even if are readonly). 
            if (_disposed) return;

            if (disposing)
            {
                // stop running all transactions
                _monitor?.Dispose();

                // do a soft checkpoint (only if exclusive lock is possible)
                if (_header?.Pragmas.Checkpoint > 0) _walIndex?.TryCheckpoint();

                // close all disk streams (and delete log if empty)
                _disk?.Dispose();

                // delete sort temp file
                _sortDisk?.Dispose();

                // dispose lockers
                _locker?.Dispose();
            }

            LOG("engine disposed", "ENGINE");

            _disposed = true;
        }

    }

When the line below fails, you end up not disposing _disk, _sortDisk and _locker - which in turn leaves dead streams in memory. You can and should report any error while disposing, but ideally all unused resources should go away first and then finally you can throw whatever exception(s) you caught.

if (_header?.Pragmas.Checkpoint > 0) _walIndex?.TryCheckpoint();

gasparecoolshop commented 3 years ago

@cristianoalvesdasilva did you solved this? Perhaps disposing _disk, _sortDisk, _locker in case of exceptions too?

cristianoalvesdasilva commented 3 years ago

@cristianoalvesdasilva did you solved this? Perhaps disposing _disk, _sortDisk, _locker in case of exceptions too?

The issue itself is still intermittent. My attempt here was more towards ensuring no leftovers remain in memory.

mbdavid / LiteDB

[BUG] - System.IO.IOException: Win32 IO returned 25 #1785