SF application upgrade fails if there are unhealthy services _prior_ to the upgrade

alecor191 commented 7 years ago

Scenario

Summary

I have a cluster with an app deployed (with 5 services):

One of the services starts reporting "Error" to ASF's health metrics (expected, this is part of my test).
Let's say this is a bug and I have a fix for it, that I want to rollout.
I perform an upgrade of my app, but it fails. Apparently, the upgrade process considers the "Error" reported by the already deployed service as upgrade failure and rolls back.

Details

I have a cluster with 3 nodes with an app deployed (5 services, 1 instance of each service). Service "Web" reports an "Error" to ASF (note that "Web" is hosted on Node 2):

screenshot001

Now I have my fix ready and perform a rolling upgrade. The screenshot was taken just after kicking off the upgrade. As you can see it is still upgrading Node 0. Also note that the error shown on the screenshot is still the one mentioned above: Coming from service "Web" on Node 2:

screenshot002

As soon as services on Node 0 are deployed, the upgrade process starts to roll back the upgrade (after HealthCheckRetryTimeoutSec), even though all services deployed to Node 0 are healthy:

screenshot004

Expected outcome

Upgrade process proceeds for all 3 Nodes. Once Node 2 is upgraded and the patched "Web" service deployed, it will report that it is healthy and the upgrade process completes successfully.

Actual outcome

Upgrade process gives up after upgrading Node 0. So the patched "Web" service never gets deployed. It appears that the error reported by the buggy "Web" service has an impact on the upgrade process.

Upgrade configuration

Here the settings I used for my upgrade:

Rolling Upgrade Mode: Monitored
Failure Action: Rollback
Health Check Wait Duration: 10 sec
Health Check Stable Duration: 20 sec
Health Check Retry Timeout: 30 sec
Upgrade Timeout/Domain Timeout: Infinity

I also tried different variations of timeouts; but that doesn't appear to help.

Am I missing something in this scenario? If the above behaves as designed, what's the recommendation of upgrading an application that reports "Error" in health metrics?

vipul-modi commented 7 years ago

Please go through the following documentation.

You can .. a. change the application health policy during the upgrade. OR b. use manual upgrade mode in which no health checks are performed only availability checks for the stateful services

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-parameters https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-health-introduction

masnider commented 7 years ago

Specifically what Vipul is directing you towards is the concept of "Delta Health" - you can determine how many nodes are unhealthy at the start of the upgrade and then mask them out, the idea being that the upgrade shouldn't make things worse, but shouldn't be penalized for existing issues (since you may be trying to fix them).

Edit: I forgot that delta health only applies to cluster upgrades. You should look at the ServiceTypeHealthPolicyMap to define a threshold for the upgrade that is above the percentages that are currently unhealthy. https://docs.microsoft.com/en-us/powershell/module/servicefabric/start-servicefabricapplicationupgrade?view=azureservicefabricps

alecor191 commented 7 years ago

Thanks a lot @VipulM-MSFT and @masnider for the provided options. Using them I can indeed upgrade my (unhealthy) cluster.

However, it feels a bit like a workaround, as I'm essentially disabling health checks during upgrade for the affected services.

E.g. it could be that I introduced a regression with my fix, so I would like the upgrade to fail and get rolled back. By disabling health checks for that service during upgrade, the upgrade will succeed, resulting in the regression being deployed.

What I would expect is a way to upgrade my unhealthy cluster, by still keeping all the health checks during the upgrade in place. Is there such an option available?

oanapl commented 7 years ago

Not as of know. We are tracking this as an improvement for the future.

vipul-modi commented 7 years ago

The manual mode of upgrade provides you ability to rollback if you notice any issues after rolling it to the first UD. As @oanapl mentioned, we have enhancement planned to allow specifying delta health policies to make this easier.

Is it ok to close this issue, as you can upgrade the unhealthy applications through manual upgrade and changing the health policy.

alecor191 commented 7 years ago

Thanks everyone for your input. I'm unblocked. So feel free to either close the issue or leave it open to track the enhancements.

masnider commented 7 years ago

Glad you're up and running!

alexgman commented 6 years ago

I've got no healthy nodes, yet upgrading always times out, with the following output:

27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Waiting for upgrade... 27>Publish-UpgradedServiceFabricApplication : Upgrade Failed. 27>At F:\DEV\SolutionLocation\Configuration\Scripts\Deploy-FabricApplication.ps1:243 char:5 27>+ Publish-UpgradedServiceFabricApplication @PublishParameters 27>+ ~~~~~~~~~~~~~~~ 27> + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException 27> + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Publish-UpgradedServiceFabricApplicati 27> on 27> 27>Finished executing script 'Deploy-FabricApplication.ps1'. 27>Time elapsed: 00:07:06.2424789 27>The PowerShell script failed to execute. ========== Build: 26 succeeded, 0 failed, 19 up-to-date, 0 skipped ========== ========== Publish: 0 succeeded, 1 failed, 0 skipped ==========

oanapl commented 6 years ago

@alexgman , can you give us more details about your scenario? As it is, we don't have enough information. Is this local cluster - 1 node or 5 nodes? Or Azure? What parameters do you pass for upgrade - what is the ud timeout, the global timeout?

You said "I've got no healthy nodes" - did you mean unhealthy?

This thread is about upgrading an application where services are unhealthy before upgrade. Is this your case? If not, consider starting a new thread.

alexgman commented 6 years ago

Thank you very much for your amazingly-fast response.

I meant to say that we've got no unhealthy nodes, but in fact I am wrong: we have 1 node that is issuing a warning.

I just added these two values to the publishprofiles file, and got it working:

HealthCheckWaitDurationSec="0" HealthCheckStableDurationSec="0"

This is a 3 node cluster. On premises.

oanapl commented 6 years ago

The node in Warning will not affect the application upgrade, as the app upgrade health checks only look at the application and its children (services, partitions, etc).

Glad you got this working! Based on your findings, looks like you were hitting the upgrade timeout, and reducing the health stable duration avoided this. Keep in mind that by reducing the stable duration you may miss unhealthy reports and miss rollbacks for faulty application versions. In production, you need to be sure that you prevent buggy application versions from being deployed. Another way to avoid the upgrade rollback is to increase the upgrade timeout.

alexgman commented 6 years ago

Here's my publishprofile:

I don't think we were hitting the timeout.

On Mon, Dec 18, 2017 at 12:33 PM, Oana Platon notifications@github.com wrote:

The node in Warning will not affect the application upgrade, as the app upgrade health checks only look at the application and its children (services, partitions, etc).

Glad you got this working! Based on your findings, looks like you were hitting the upgrade timeout, and reducing the health stable duration avoided this. Keep in mind that by reducing the stable duration you may miss unhealthy reports and miss rollbacks for faulty application versions. In production, you need to be sure that you prevent buggy application versions from being deployed. Another way to avoid the upgrade rollback is to increase the upgrade timeout.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/239#issuecomment-352517766, or mute the thread https://github.com/notifications/unsubscribe-auth/AECCb29JgRdc7aLsSgaOJo0D88Az-IPiks5tBq_mgaJpZM4M6Mek .

oanapl commented 6 years ago

You can try another upgrade with previous settings; if it fails, check Get-ServiceFabricApplicationUpgrade, which should tell you the rollback reason. If the reason is timeout, then reducing the health parameters is fine (at least in test environment). If the reason is health checks failed, then it means there is a problem with the new version that is causing an Error report to be sent with the new bits. If that's the case, you need to investigate it.

alexgman commented 6 years ago

Once more I think I am having an issue here.

I've used the following:

Per your guidance, I've used this to check the reason for rollback:

Get-ServiceFabricApplicationUpgrade "fabric:/Our.API"

The reason for rollback was HealthCheck, even though I set these to 0: HealthCheckWaitDurationSec="0" HealthCheckStableDurationSec="0"

The output of get-servicefabricappliationupgrade...

FailureReason : HealthCheck UpgradeState : RollingBackCompleted UpgradeDuration : 00:12:04 CurrentUpgradeDomainDuration : 00:00:00 NextUpgradeDomain : UpgradeDomainsStatus : { "UD0" = "Completed"; "UD1" = "Completed"; "UD2" = "Completed" } UnhealthyEvaluations : Unhealthy services: 100% (1/1), ServiceType='NotificationDistributionType', MaxPercentUnhealthyServices=0%.

                            Unhealthy service:

ServiceName='fabric:/Our.API/Notifications.DistributionLists', AggregatedHealthState='Error'.

                                Unhealthy partitions: 100% (1/1),

MaxPercentUnhealthyPartitionsPerService=0%.

                                Unhealthy partition:

PartitionId='eb1cd568-1dca-4456-972a-aed7a29ab06e', AggregatedHealthState='Error'.

                                    Error event: SourceId='System.FM',

Property='State'.

UpgradeKind : Rolling RollingUpgradeMode : UnmonitoredAuto ForceRestart : True UpgradeReplicaSetCheckTimeout : 00:20:00

On Mon, Dec 18, 2017 at 12:42 PM, Oana Platon notifications@github.com wrote:

You can try another upgrade with previous settings; if it fails, check Get-ServiceFabricApplicationUpgrade https://docs.microsoft.com/en-us/powershell/module/servicefabric/get-servicefabricapplicationupgrade?view=azureservicefabricps, which should tell you the rollback reason. If the reason is timeout, then reducing the health parameters is fine (at least in test environment). If the reason is health checks failed, then it means there is a problem with the new version that is causing an Error report to be sent with the new bits. If that's the case, you need to investigate it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/service-fabric-issues/issues/239#issuecomment-352520140, or mute the thread https://github.com/notifications/unsubscribe-auth/AECCbzi0PFwDXm6XGMHRUz2nLtDke2V1ks5tBrH-gaJpZM4M6Mek .

oanapl commented 6 years ago

@alexgman , changing the health settings doesn't prevent the health checks to occur, they only configure when and what to do about them. HealthCheckWaitDurationSec specifies how long to wait until the health check is performed. HealthCheckStableDurationSec specifies for how long the health check is repeated to ensure the cluster remains healthy before going to next upgrade domain.

When the health check is performed, if the app is unhealthy, then the upgrade fails. This is as expected. If you want the upgrade to go through regardless of health, you can either perform a non-monitored upgrade or change the health policies to accept unhealthy services. However, the correct thing to do is verify why the service is unhealthy - looks like a partition is below replica target.

There is another parameter, HealthCheckRetryTimeoutSec, which configures how long to retry the health check in case of failure. So first health check fails, the system retries for HealthCheckRetryTimeoutSec or until the first successful health check. By default, this is 2 min. If you think your service takes a while to stabilize (open the replicas) you may look into increasing it.

microsoft / service-fabric-issues