microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

Unable to restore from backup #1537

Closed nkorai closed 5 years ago

nkorai commented 5 years ago

We just had an issue in our production cluster that had some pretty catastrophic consequences. - Our Stateful service in charge of all authentication encountered an issue with replication, which caused it to be "stuck". After some debugging we concluded that it was causing enough down time where it was just better to delete the service and attempt to restore from a backup.

Application: Rics.AlphaService.exe
Framework Version: v4.0.30319
Description: The application requested process termination through System.Environment.FailFast(string message).
Message: input key must be greater than or equal to last key processed.
Stack:
   at System.Environment.FailFast(System.String, System.Exception)
   at Microsoft.ServiceFabric.Data.ReplicatedStore.DifferentialStore.RecoveryStoreComponent`5[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].AddOrUpdate(System.__Canon, System.Fabric.Store.TVersionedItem`1<System.__Canon>, Int64)
   at Microsoft.ServiceFabric.Data.ReplicatedStore.DifferentialStore.RecoveryStoreComponent`5+<MergeAsync>d__19[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Boolean)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(Boolean)
   at System.Fabric.Store.KeyCheckpointFileAsyncEnumerator`2+<MoveNextAsync>d__24[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Boolean)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Boolean, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].SetResult(Boolean)
   at System.Fabric.Store.KeyCheckpointFileAsyncEnumerator`2+<ReadChunkAsync>d__25[[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.__Canon, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean, System.Threading.Tasks.Task ByRef)
   at System.Threading.Tasks.Task.FinishContinuations()
   at System.Threading.Tasks.Task`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]].TrySetResult(Int32)
   at System.IO.FileStream.EndReadTask(System.IAsyncResult)
   at System.IO.FileStreamAsyncResult.AsyncFSCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

From the event log the error seems to be coming from here: https://github.com/microsoft/service-fabric/blob/6f2a641bdfe4b7e7100780cd4a4473f8eda30d5c/src/prod/src/managed/Microsoft.ServiceFabric.Data.Impl/ReplicatedStore/DifferentialStore/RecoveryStoreComponent.cs

Or the very least it is one of these files: https://github.com/microsoft/service-fabric/search?q=input+key+must+be+greater+than+or+equal+to+last+key+processed&unscoped_q=input+key+must+be+greater+than+or+equal+to+last+key+processed

This issue is still affecting us and is causing an outage for a percentage of our clients so any help here would be appreciated. If we could even mount the backups locally using the backup viewing client that was alluded to a few months ago then we could at least pull the data out but right now we're stuck.

aL3891 commented 5 years ago

man that sucks.. :/ we also do backup/restore like that but mostly with actor services.. maybe you can try compiling from the SF source and see if you can hack around it some how? that backup api they're been talking about sure would have been useful here..

raunakpandya commented 5 years ago

If this is critical, please open a support ticket with Microsoft Azure support for quick assistance.

nkorai commented 5 years ago

We have an open ticket with support trying to deal with this. We are generating a Unicode string out of a random byte array and using that as the key in the ReliableDictionary. This seems to be a possible cause of the issue but they are still looking into it. If nothing else we just want to recover the data in those backups, and yeah @aL3891 that backup API would have been very useful here to do that.

Also there isn't any documentation as far as I am aware about not using certain characters if you're using a plain string as the key of the reliable dictionary and the potential of that breaking the entire replication/restoration mechanism. Apparently null characters '\0' are a no go, I don't know what others would do this too.

nkorai commented 5 years ago

Support was able to patch our backup checkpoint file and get it back to us. They weren't able to track down the issue exactly but did say a possibility could have been .GetHashCode and .Equals returning different results when a null character \0 was involved. They said they'll be addressing this known bug. That's all I got folks, I'll be closing this issue now.

tastyeggs commented 5 years ago

Does anyone know if this bug has been fixed? I just hit the exact same stack trace, but while running a regular application upgrade.

zuhairp commented 5 years ago

We found that addressing the bug would not be a backwards compatible change and would break existing apps, so we decided not to go through with it.

As mentioned earlier, if this is critical, please open a support ticket with Microsoft Azure support for quick assistance.