microsoft / durabletask-netherite

A new engine for Durable Functions. https://microsoft.github.io/durabletask-netherite
Other
221 stars 24 forks source link

Partitions become unresponsive #342

Closed alexjebens closed 4 months ago

alexjebens commented 8 months ago

We are frequently seeing some of our partitions becoming unresponsive in the partitions table.

The partition seems unable to start. Deleting the partitions eventhub resolves the problem but all pending messages are lost, which is not acceptable for this application.

Microsoft.Azure.DurableTask.Netherite.AzureFunctions: 1.4.1 Microsoft.Azure.Webjobs.Extensions.FurableTask: 2.13.0 Microsoft.NET.Sdk.Functions: 4.2.0

Hosting: EP2 with scaling to 2

@sebastianburckhardt I sent you an email with details on the function.

alexjebens commented 8 months ago

I have a theory: is it possible that the eh-checkpoints become desynchronized from the FASTER storage state? Would it then not be possible to implement a checkpoint provider that utilises FASTER.

alexjebens commented 7 months ago

I have an exception:

An attempt was made to move the position before the beginning of the stream.

at System.IO.MemoryStream.Seek(Int64 offset, SeekOrigin loc) at FASTER.core.GenericAllocator2.Deserialize(Byte* raw, Int64 ptr, Int64 untilptr, Record2[] src, Stream stream) at FASTER.core.GenericAllocator`2.AsyncReadPageWithObjectsCallback[TContext](UInt32 errorCode, UInt32 numBytes, Object context) at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass36_0.b0(Task t) in C:\Users\A60800\src\durabletask-netherite\src\DurableTask.Netherite\StorageLayer\Faster\AzureBlobs\AzureStorageDevice.cs:line 399 at System.Threading.Tasks.ContinuationTaskFromTask.InnerInvoke() at System.Threading.Tasks.Task.<>c.<.cctor>b272_0(Object obj) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)

alexjebens commented 7 months ago

This is not resolved when integrating the changes in FASTER.core from #343

alexjebens commented 7 months ago

It is specifically this line in FASTER.core

The value of key_addr->Address is 0 which leads to a negative value. streanStartPos = 368, start_addr = 37587456

alexjebens commented 7 months ago

Manually updating to FASTER.core 2.0.23 did not prevent further corruption

sebastianburckhardt commented 6 months ago

Thanks for reporting this. We are currently suspecting that this corruption is caused by a bug that was fixed in the latest FASTER. Therefore we are now prioritizing updating Netherite to use the latest FASTER version. This work is tracked here: https://github.com/microsoft/durabletask-netherite/pull/344.

sebastianburckhardt commented 4 months ago

We have released 1.5.0 which contains a much newer version of FASTER, which we think will fix the corruption of FASTER object files. Let us know if you still see these symptoms with version 1.5.0.