microsoft / FASTER

Fast persistent recoverable log and key-value store + cache, in C# and C++.
https://aka.ms/FASTER
MIT License
6.29k stars 563 forks source link

Deserializing page content despite errorCode != 0? #883

Open sebastianburckhardt opened 10 months ago

sebastianburckhardt commented 10 months ago

While investigating failures of Netherite in customer code (see here) I noticed a stack trace where OOM exceptions were thrown from FASTER at a time when shutting down, which is surprising because at that point all outstanding memory operations were just being cancelled - so I was not expecting any OOMs to be thrown.

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.IO.BinaryReader.ReadBytes(Int32 count) at DurableTask.Netherite.Faster.FasterKV.Value.Serializer.Deserialize(Value& obj) in //src/DurableTask.Netherite/StorageLayer/Faster/FasterKV.cs:line 1594 at FASTER.core.GenericAllocator2.Deserialize(Byte* raw, Int64 ptr, Int64 untilptr, Record2[] src, Stream stream) at FASTER.core.GenericAllocator`2.AsyncReadPageWithObjectsCallback[TContext](UInt32 errorCode, UInt32 numBytes, Object context) at DurableTask.Netherite.Faster.AzureStorageDevice.CancelAllRequests() in //src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 246 at System.Threading.CancellationToken.<>c.b__12_0(Object obj) at System.Threading.CancellationTokenSource.Invoke(Delegate d, Object state, CancellationTokenSource source) at System.Threading.CancellationTokenSource.CallbackNode.<>c.b__9_0(Object s) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) --- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.CancellationTokenSource.CallbackNode.ExecuteCallback() at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)

Taking a closer look at AsyncReadPageWithObjectsCallback, I can see that the errorCode is being basically ignored (other than for logging). I don't understand why it is o.k. for this code to read and deserialize the results even though this callback is a cancellation, i.e. the read was never completed?

private void AsyncReadPageWithObjectsCallback<TContext>(uint errorCode, uint numBytes, object context)
{
    if (errorCode != 0)
    {
        logger?.LogError($"AsyncReadPageWithObjectsCallback error: {errorCode}");
    }

    PageAsyncReadResult<TContext> result = (PageAsyncReadResult<TContext>)context;

    Record<Key, Value>[] src;

    // We are reading into a frame
    if (result.frame != null)
    {
        var frame = (GenericFrame<Key, Value>)result.frame;
        src = frame.GetPage(result.page % frame.frameSize);
    }
    else
        src = values[result.page % BufferSize];

    // Deserialize all objects until untilptr
    if (result.resumePtr < result.untilPtr)
    {
        MemoryStream ms = new(result.freeBuffer2.buffer);
        ms.Seek(result.freeBuffer2.offset, SeekOrigin.Begin);
        Deserialize(result.freeBuffer1.GetValidPointer(), result.resumePtr, result.untilPtr, src, ms);
        ms.Dispose();

        result.freeBuffer2.Return();
        result.freeBuffer2 = null;
        result.resumePtr = result.untilPtr;
    }

    // If we have processed entire page, return
    if (result.untilPtr >= result.maxPtr)
    {
        result.Free();

        // Call the "real" page read callback
        result.callback(errorCode, numBytes, context);
        return;
    }