microsoft / FASTER

Fast persistent recoverable log and key-value store + cache, in C# and C++.
https://aka.ms/FASTER
MIT License
6.32k stars 567 forks source link

Sporadic AccessViolationException on iteration GetNext #769

Closed avonwyss closed 1 year ago

avonwyss commented 1 year ago

Using FASTER 2.0.22 (we cannot update to 2.1.0 right now due to NuGet dependencies).

We have a custom compaction method which scans the log as follows:

fasterKv.Log.FlushAndEvict(true);
var firstDataAddress = default(long?);
var session = GetPooledSession();
try {
    var alive = 0L;
    using (var iter = fasterKv.Log.Scan(fasterKv.Log.BeginAddress, fasterKv.Log.TailAddress)) {
        while (iter.GetNext(out var recordInfo) && !recordInfo.Tombstone) {
            var spanByteAndMemory = iter.GetKey().Deserialize().ToSpanByteAndMemory();
            keyConverter.FromSpanByte(ref spanByteAndMemory, out var key);
            if (!isDeleted(in key)) {
                if (!firstDataAddress.HasValue) {
                    firstDataAddress = fasterKv.Log.TailAddress;
                }
                session.Upsert(ref iter.GetKey(), ref iter.GetValue());
                alive++;
            } else {
                session.Delete(ref iter.GetKey());
            }
        }
    }
    logger.Info("{count} alive cache entries found", alive);
}
finally {
    ReleasePooledSession(session);
}
fasterKv.Log.ShiftBeginAddress(Math.Min(fasterKv.Log.SafeReadOnlyAddress, Math.Max(firstDataAddress.GetValueOrDefault(), fasterKv.Log.SafeReadOnlyAddress - (maxSizeBytes - maxSizeBytes / 5))), true);

This is running concurrently to other operations on the log in a ASP.NET MVC application.

On the server under load we get sporadic (but pretty frequent) AccessViolationException in the iterator's GetNext:

Application: w3wp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.AccessViolationException
   at FASTER.core.SpanByteVarLenStruct.GetLength(FASTER.core.SpanByte ByRef)
   at FASTER.core.VariableLengthBlittableAllocator`2[[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2],[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2]].GetRecordSize(Int64)
   at FASTER.core.VariableLengthBlittableScanIterator`2[[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2],[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2]].GetNext(FASTER.core.RecordInfo ByRef)
   at Seram.Web.Cache.FasterCache`2[[Seram.Area.Indicators.Values.ValueCacheKey, Seram.Area.Indicators, Version=2.2790.8356.3393, Culture=neutral, PublicKeyToken=null],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].Compact(Seram.Web.Cache.KeyPredicate`1<Seram.Area.Indicators.Values.ValueCacheKey>)
   at Seram.Area.Indicators.CacheMaintenance+<CacheCleanupInternal>d__2.MoveNext()
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Seram.Area.Indicators.CacheMaintenance+<CacheCleanupInternal>d__2, Seram.Area.Indicators, Version=2.2790.8356.3393, Culture=neutral, PublicKeyToken=null]](<CacheCleanupInternal>d__2 ByRef)
   at Seram.Area.Indicators.CacheMaintenance.CacheCleanupInternal(Sirius.Mvc.IServiceLocator, Boolean, System.Threading.CancellationToken)
   at Seram.Area.Indicators.CacheMaintenance.CacheCleanup(Sirius.Mvc.IServiceLocator, System.Threading.CancellationToken)
   at Seram.Area.TenantManager.MultiTenantProvider+<>c__DisplayClass9_1+<<PerformMaintenance>b__0>d.MoveNext()
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Seram.Area.TenantManager.MultiTenantProvider+<>c__DisplayClass9_1+<<PerformMaintenance>b__0>d, Seram.Area.TenantManager, Version=2.2790.8356.3394, Culture=neutral, PublicKeyToken=null]](<<PerformMaintenance>b__0>d ByRef)
   at Seram.Area.TenantManager.MultiTenantProvider+<>c__DisplayClass9_1.<PerformMaintenance>b__0()
   at System.Threading.Tasks.Task`1[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].InnerInvoke()
   at System.Threading.Tasks.Task.Execute()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.ExecuteEntry(Boolean)
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

Are we somehow causing this by inadequate memory-related code or is this a problem of FASTER?

badrishc commented 1 year ago

not sure, we have never seen this issue on our side. if you can provide a stand alone repro we will be able to diagnose it further, thanks.

badrishc commented 1 year ago

i think 2.1.0 should also be upgraded to as it does contain some important fixes.

avonwyss commented 1 year ago

Thank you for the reply. As I noted the problem occurred only under load, e.g. with a high degree of concurrency read and write while the scan was started. We also noticed memory runaway (with an OoO exception after more than 60GB were used) and really poor performance.

Upon further inspection it seemed that TakeHybridLogCheckpointAsync() was performing poorly when invoked during high load and that the scan may have been failing if the TakeHybridLogCheckpointAsync() was still running in the background. At least the issue seems to be gone now that we are only taking the checkpoint when SystemState.Phase == Phase.REST, e.g. avoiding the checkpoint under load.

avonwyss commented 1 year ago

@badrishc We have now upgraded to 2.2.0 now but still face issues related to scanning and snapshotting.

For one, when upsert is done as per the code above, the memory grows since there seems to be a pin to memory which is not released, thus the GC cannot work:

image

image

All of these are sparse (only zeros) byte[] arrays of 32MB in size.

However, when I remove the upsert call and do the snapshot, odd things start happening. At some point in time I get the following exception when accessing the data:

System.NullReferenceException: Object reference not set to an instance of an object.
   at FASTER.core.ClientSession`6.InternalFasterSession.TryLockEphemeralExclusive(RecordInfo& recordInfo)
   at FASTER.core.FasterKV`2.InternalUpsert[Input,Output,Context,FasterSession](Key& key, Input& input, Value& value, Output& output, Context& userContext, PendingContext`3& pendingContext, FasterSession fasterSession, FasterExecutionContext`3 sessionCtx, Int64 lsn)
   at FASTER.core.FasterKV`2.UpsertAsync[Input,Output,Context](IFasterSession`5 fasterSession, FasterExecutionContext`3 currentCtx, PendingContext`3& pcontext, Key& key, Input& input, Value& value, Context userContext, Int64 serialNo, CancellationToken token)
   at FASTER.core.ClientSession`6.UpsertAsync(Key& key, Input& input, Value& desiredValue, Context userContext, Int64 serialNo, CancellationToken token)
   at FASTER.core.ClientSession`6.UpsertAsync(Key& key, Value& desiredValue, Context userContext, Int64 serialNo, CancellationToken token)
   at Seram.Web.Cache.FasterCache`2.<UpsertAsync>d__25.MoveNext() in C:\Hg\Seram.Web\Seram.Web.Shared\Cache\FasterCache.cs:line 133
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1.ConfiguredValueTaskAwaiter.GetResult()
   at Seram.Area.Indicators.Values.ValueContext.<>c__DisplayClass96_0.<AddValueToCache>g__UpsertContinuation|0() in C:\Hg\Seram.Web\Seram.Area.Indicators\Values\ValueContext.cs:line 437

From there on, the FASTER log seems to be corrupted. Even just querying the count causes an exception:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at FASTER.core.FasterKV`2.GetEntryCount() in D:\a\1\s\cs\src\core\Index\FASTER\FASTER.cs:line 866

...and the snapshotted data cannot be loaded from disk (also crashes FASTER).

avonwyss commented 1 year ago

...and we still get AVs:

System.AccessViolationException
   at FASTER.core.SpanByteVarLenStruct.GetLength(FASTER.core.SpanByte ByRef)
   at FASTER.core.VariableLengthBlittableAllocator`2[[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2],[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2]].GetRecordSize(Int64)
   at FASTER.core.VariableLengthBlittableScanIterator`2[[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2],[FASTER.core.SpanByte, FASTER.core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=eb19722ac09e9af2]].GetNext(FASTER.core.RecordInfo ByRef)
   at Seram.Web.Cache.FasterCache`2[[Seram.Area.Indicators.Values.ValueCacheKey, Seram.Area.Indicators, Version=2.2804.8409.20143, Culture=neutral, PublicKeyToken=null],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].Compact(Seram.Web.Cache.KeyPredicate`1<Seram.Area.Indicators.Values.ValueCacheKey>, Boolean)
   at Seram.Area.Indicators.CacheMaintenance+<CacheCleanupInternal>d__2.MoveNext()