Closed nkorai closed 5 years ago
man that sucks.. :/ we also do backup/restore like that but mostly with actor services.. maybe you can try compiling from the SF source and see if you can hack around it some how? that backup api they're been talking about sure would have been useful here..
If this is critical, please open a support ticket with Microsoft Azure support for quick assistance.
We have an open ticket with support trying to deal with this. We are generating a Unicode string out of a random byte array and using that as the key in the ReliableDictionary. This seems to be a possible cause of the issue but they are still looking into it. If nothing else we just want to recover the data in those backups, and yeah @aL3891 that backup API would have been very useful here to do that.
Also there isn't any documentation as far as I am aware about not using certain characters if you're using a plain string as the key of the reliable dictionary and the potential of that breaking the entire replication/restoration mechanism. Apparently null characters '\0' are a no go, I don't know what others would do this too.
Support was able to patch our backup checkpoint file and get it back to us. They weren't able to track down the issue exactly but did say a possibility could have been .GetHashCode and .Equals returning different results when a null character \0
was involved. They said they'll be addressing this known bug. That's all I got folks, I'll be closing this issue now.
Does anyone know if this bug has been fixed? I just hit the exact same stack trace, but while running a regular application upgrade.
We found that addressing the bug would not be a backwards compatible change and would break existing apps, so we decided not to go through with it.
As mentioned earlier, if this is critical, please open a support ticket with Microsoft Azure support for quick assistance.
We just had an issue in our production cluster that had some pretty catastrophic consequences. - Our Stateful service in charge of all authentication encountered an issue with replication, which caused it to be "stuck". After some debugging we concluded that it was causing enough down time where it was just better to delete the service and attempt to restore from a backup.
From the event log the error seems to be coming from here: https://github.com/microsoft/service-fabric/blob/6f2a641bdfe4b7e7100780cd4a4473f8eda30d5c/src/prod/src/managed/Microsoft.ServiceFabric.Data.Impl/ReplicatedStore/DifferentialStore/RecoveryStoreComponent.cs
Or the very least it is one of these files: https://github.com/microsoft/service-fabric/search?q=input+key+must+be+greater+than+or+equal+to+last+key+processed&unscoped_q=input+key+must+be+greater+than+or+equal+to+last+key+processed
This issue is still affecting us and is causing an outage for a percentage of our clients so any help here would be appreciated. If we could even mount the backups locally using the backup viewing client that was alluded to a few months ago then we could at least pull the data out but right now we're stuck.