Closed alexjebens closed 4 months ago
I have a theory: is it possible that the eh-checkpoints become desynchronized from the FASTER storage state? Would it then not be possible to implement a checkpoint provider that utilises FASTER.
I have an exception:
An attempt was made to move the position before the beginning of the stream.
at System.IO.MemoryStream.Seek(Int64 offset, SeekOrigin loc)
at FASTER.core.GenericAllocator2.Deserialize(Byte* raw, Int64 ptr, Int64 untilptr, Record
2[] src, Stream stream)
at FASTER.core.GenericAllocator`2.AsyncReadPageWithObjectsCallback[TContext](UInt32 errorCode, UInt32 numBytes, Object context)
at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass36_0.
This is not resolved when integrating the changes in FASTER.core from #343
It is specifically this line in FASTER.core
The value of key_addr->Address is 0 which leads to a negative value. streanStartPos = 368, start_addr = 37587456
Manually updating to FASTER.core 2.0.23 did not prevent further corruption
Thanks for reporting this. We are currently suspecting that this corruption is caused by a bug that was fixed in the latest FASTER. Therefore we are now prioritizing updating Netherite to use the latest FASTER version. This work is tracked here: https://github.com/microsoft/durabletask-netherite/pull/344.
We have released 1.5.0 which contains a much newer version of FASTER, which we think will fix the corruption of FASTER object files. Let us know if you still see these symptoms with version 1.5.0.
We are frequently seeing some of our partitions becoming unresponsive in the partitions table.
The partition seems unable to start. Deleting the partitions eventhub resolves the problem but all pending messages are lost, which is not acceptable for this application.
Microsoft.Azure.DurableTask.Netherite.AzureFunctions: 1.4.1 Microsoft.Azure.Webjobs.Extensions.FurableTask: 2.13.0 Microsoft.NET.Sdk.Functions: 4.2.0
Hosting: EP2 with scaling to 2
@sebastianburckhardt I sent you an email with details on the function.