Upgrade service error after expiration of already replaced certificate

tuhland commented 5 years ago

We have replaced our cluster certificates, because the old ones were about to expire. We updated the thumbprints and switched to name where applicable and redeployed the ARM template. Everything worked as intended. When the old certificate finally expired, the upgrade service started reporting the following error:

Error event: SourceId='UpgradeService.Primary', Property='SFRPPoll'. Exception encountered: System.UnauthorizedAccessException: Certificates EXPIRED CERT THUMBPRINT not authorized to access URL https://germanycentral.servicefabric.microsoftazure.de/runtime/clusters/*GUID*?api-version=6.2 at System.Fabric.UpgradeService.RestClientHelper.d1.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Fabric.UpgradeService.WrpGatewayClient.d12`2.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Fabric.UpgradeService.WrpPackageRetriever.d4.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Fabric.UpgradeService.ResourceCoordinator.d13.MoveNext()

It seems, like there's still some resource using the old certificate thumbprint, even though we replaced them all in the ARM template.

The old certificate also still worked for client authentication after replacing it.

Could it be that there are resources, which got automatically created but are not part of our initial ARM template, that have a certificate configuration and need to be added?

tuhland commented 5 years ago

We still have this issue. The applications run as expected and we can update them, but the error persists.

Does anyone have an idea what it could mean, or if we can 'safely' ignore it for now? We have to move from Azure Germany to Azure proper this year, so we have to create new clusters anyway.

gblmarquez commented 5 years ago

@tuhland @motanv We are experiencing the same issue here. The only difference is that we change from thumbprint to common name certificate when we are near to the expiration.

Do you have found any solution, that you could share with us?

Thanks in advance!

tuhland commented 5 years ago

@gblmarquez We still have this issue. We also did the change from thumbprint to common name.

What i can tell from having this issue for some time: It seems to be whatever Azure component is responsible for updating the SF Runtime of the cluster. It presents the old expired thumbprint/certificate and gets "Access Denied", but then either continuously retries or never fails the upgrade. Our clusters have been stuck in "Updating" state for weeks now.

We currently have a support case running. I will update this issue, if that leads to a solution.

gblmarquez commented 5 years ago

Thanks for the update.

In our case the service fabric cluster is on a failure state.

Did you had any impact on the production environment because of that issue?

May I ask you to share with us the support channel that you used to open the incendent case?

dragav commented 5 years ago

With apologies for the belated reply - this fell off my radar:

@thuland The Fabric Upgrade Service is responsible for syncing with/fetching upgrade information from the SF Resource Provider service. It does, indeed, pick up a certificate which may be expired/not time-valid; this has been fixed in SF version 6.4.644.9590. To mitigate the issue, please delete the expired certificate from all nodes of the cluster.

To change the certificate declaration from thumbprint to common name, I recommend having a look at this document: convert existing Azure SF cluster to common name. Please ensure the certificate intended to be used by CN is installed on all nodes before making the switch.

Regarding customer support: there are ways of escalating support cases to the engineering team. If you're getting stuck or have no traction, you can ping me directly (dragosav at microsoft dot com.)

gblmarquez commented 5 years ago

@dragav Thanks for the kind reply :)

We followed this tutorial line by line.

In our case the "new" certificate is installed on all nodes, we confirmed manually.

The message we are the same, but the certificate thumbprint that is present is from the "new" certificate, that uses the common name.

We are afraid of starting have other issues and downtime on production services. Do you have any other ideas? No other upgrades to the service fabric are being processed because the upgrade service is on error state and the cluster too.

Bellow is the message we are receiving UpgradeService.Primary reported Error for property 'SFRPPoll'. Exception encountered: System.UnauthorizedAccessException: Certificates THUMBPRINT_HERE not authorized to access URL https://westus.servicefabric.azure.com/runtime/clusters/id_here?api-version=6.5

dragav commented 5 years ago

@gblmarquez if the 'new' thumbprint is rejected by the SF Resource Provider (the server of the SFRPPoll operation), that seems to indicate that the 'current' cluster configuration (the one from which the upgrade started) does not list it as either primary or secondary.

It's difficult to make a definitive statement, though, without having the details of your cluster. If you can, please raise an issue through your customer support representative. (For instance, via Azure Support.) If that's not an option, feel free to reach out privately to me at the aforementioned address, so we can take this offline.

tuhland commented 5 years ago

@dragav It looks like that solved it for us. I logged into all VMs and removed the expired certificate from the machine store. The errors went away and nodes are going down and getting updated now.

So it basically came down to a "dumb" algorithm selecting the certificates. It takes the first one matching the common name, without taking expiration into consideration ;)

I'll keep an eye on it for a while, to see if the certificate reappears, because I think I deleted it already some time ago.

EDIT: My hunch was correct, the certificate reappeared once on one of the clusters and I had to delete it again.

dragav commented 5 years ago

@tuhland you're probably provisioning certs to the VMSS using the 'vaultCertificates' section under osProfile in your VMSS definition. Please check the certificate urls for stale versions. Any certificate specified there (that you delete manually) will be re-provisioned on to the VM upon rebooting.

As for the certificate selection logic - you are correct, that is essentially what happened. This was the upgrade service only, you can rest assured that the runtime forming your cluster selects certificates correctly. I recommend you upgrade to at least version 6.4.644.

microsoft / service-fabric-issues

Upgrade service error after expiration of already replaced certificate #1435