Upgrade manager stuck in no InService instances.

ameyajoshi99 commented 2 years ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened: We are using release v1.0.4 of upgrade manager. The manager is not able to complete the rollout, especially for cluster with nodes roughly greater than 20.

We've seen couple of errors in the logs

failed to set instances to stand-by:

{"level":"info","ts":1649087216.0083492,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-0c485e03bd870299e","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-097374badf1782ccb","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-097374badf1782ccb is not in InService.\n\tstatus code: 400, request id: a477d57c-af4e-44a6-8f3b-f89d711e1f35","name":"upgrade-manager/asg-k8s-1-02022012012062550510000000a"}

no InService instances in the batch:

{"level":"info","ts":1649160753.7560294,"logger":"controllers.RollingUpgrade","msg":"selecting batch for rotation","batch size":1,"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560575,"logger":"controllers.RollingUpgrade","msg":"rotating batch","instances":["i-0738c1f7e01cf2ce7"],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560735,"logger":"controllers.RollingUpgrade","msg":"no InService instances in the batch","batch":["i-0738c1f7e01cf2ce7"],"instances(InService)":[],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}

In both cases, these logs start to repeat till manager fails. Failure time is around in 1hr. When we look at the ASG, the nodes in the logs are in standby state. The logs in upgrade manager for setting node to standby and the time in ASG match so we can say upgrade manager did set the nodes to stand by. However upgrade manager start sending out above logs and is stuck in that error till it fails. Manager shows the same logs even if we manually drain and delete the node.

What you expected to happen: Upgrade manager should rollout the nodes.

Environment:

rolling-upgrade-controller version 1.0.4
Kubernetes version : 1.21

eytan-avisror commented 2 years ago

Thanks @ameyajoshi99 We've addressed some issues around LaunchTemplate caching in https://github.com/keikoproj/upgrade-manager/pull/322 which is not in the latest release. Could you try out :master tag and see if that works better? We can create a release with this fix if needed.

Also, you mention that all instances in StandBy - are there no new instances inservice? when an instance is set to standby, a new one should automatically launch and should be InService. Can you look at the ASG's activity history to see if there was a failure to launch new instances for some reason?

CC @shreyas-badiger

shreyas-badiger commented 2 years ago

failed to set instances to standby

This is happening because of a caching issue. Now, (in latest code on master) We are flushing the ec2 object upon every reconcile operation therefore you shouldn't see this issue anymore.

no inService instances in batch

This message means that amongst the instances in the batch (that are considered for rotation) they all have been set to 'StandBy' and they aren't 'InService' anymore. This was done to avoid some of the corner cases where we will end up setting same instances to 'StandBy' multiple times for which AWS APIs will return error. So, this isn't the final message or a log message of concern. You should look for next few lines. "New nodes yet to join" or "new instances yet to join"

Bottom line, launch templates roll up could either get stuck or fail or not start at all because upgrade-manager could be operating on stale data. We have fixed that. Next release should address this issue.

Please try the latest code from upgrade-manager branch. And let us know if you still have the issue.

eytan-avisror commented 2 years ago

Would be good to add a hotfix release with the latest changes

shreyas-badiger commented 2 years ago

@eytan-avisror agreed. I am OOO today. Will consider doing it tomorrow.

ameyajoshi99 commented 2 years ago

@eytan-avisror / @shreyas-badiger Thanks for the response .. will try out the fix.

ameyajoshi99 commented 2 years ago

I tried out the the master branch. Seems like that did work for cluster with 15 nodes. However I noticed few errors in the log. This did not failed job.

{"level":"info","ts":1650889090.4844146,"logger":"controllers.RollingUpgrade","msg":"***Reconciling***"}
{"level":"info","ts":1650889090.4844556,"logger":"controllers.RollingUpgrade","msg":"operating on existing rolling upgrade","scalingGroup":"asg-3","update strategy":{"type":"randomUpdate","mode":"eager","maxUnavailable":"20%","drainTimeout":300},"name":"upgrade-manager/asg-3-220220330103350910100000024"}
{"level":"info","ts":1650889090.540366,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-abcd","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-pqrs","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-pqrs is not in InService.\n\tstatus code: 400, request id: 99b4774a-cc2d-4a18-9b09-5ae3855cc904","name":"upgrade-manager/asg-3-120220330103350918000000025"}

shreyas-badiger commented 2 years ago

Looks like an instance which is not inService (either in pending or terminating state, is being attempted for a standby) I shall look into it. Can you share complete logs in a file?

ameyajoshi99 commented 2 years ago

upgrade-manager.log Hi attached log file.

shreyas-badiger commented 2 years ago

@ameyajoshi99 I looked into the logs. Let me explain what is happening. I am explaining everything in detail so that you also understand and are encouraged to contribute.

NOTE: This comment explains the process, and next comment talks about the error you are facing.

Process:

Flow charts

Once a CR is admitted, we start processing the nodes for rotation batch-by-batch. The batch size is determined by maxUnavailable.(In your case, the maxUnavailable is set to 20% and the batch size is 2.)

How do we select a batch? A batch is selected either Uniformly across all AZs or they are selected Randomly. Priority is given to the instances that were already InProgress. A tag that gets attached to the instance when we processed it for the first time. Something like this: {"level":"info","ts":1651060486.7610822,"logger":"controllers.RollingUpgrade","msg":"found in-progress instances","instances":["i-abcdefg","i-pqrstu"]}

These steps are followed when we are processing a batch: (Eager mode, Lazy mode skips the waiting in step 3,4,5.)

The last step marks completion of processing a batch. We repeat the above steps until we achieve the finite state for the CR. i.e. All the instances in the ASG have the same launch config/ launch template as ASG.

shreyas-badiger commented 2 years ago

Now, talking about the error you are facing:

{"level":"info","ts":1651060486.9737825,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by",
"instances":[{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-pqrstu","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},
"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-abcdefg","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-abcdefg is not in InService.\n\tstatus code: 400, request id: 54bf7c35-c9e3-438f-90ea-ea6b760afd29","name":"upgrade-manager/asg-3-220220330103350910100000024"}

We added this check to make sure we follow the step 2 mentioned above doesn't hit errors while setting instances to StandBy. (AWS API to setInstanceStandBy is allowed only on the instances that are InService.)

I am doing the right thing by setting the instances that are InService in batch to InProgress. However, I am not following the same when I am setting instances to StandBy

Basically the bug is in this line:

....
if err := r.SetBatchStandBy(batchInstanceIDs); err != nil {
...
...

Instead, it should have been:

....
if err := r.SetBatchStandBy(inServiceInstanceIDs); err != nil {
...
...

Can you make this change, test and send out a PR for this?

ameyajoshi99 commented 2 years ago

@shreyas-badiger Thanks a lot for detailed explanation .. That was really helpful .. I will do the change you suggested and test it out on the cluster. If works well, I'll add PR here ...

About batch, we have not specified any strategy type, so if & elseif in https://github.com/keikoproj/upgrade-manager/blob/8e0f67db323ec5b7bea0fd4d9f96d23f499e7e66/controllers/upgrade.go#L419 is not coming into picture .. So the result of CalculateMaxUnavailable is getting used.

We've 6 total nodes in 1 asg & maxUnavailable is 20%. intstr.GetValueFromIntOrPercent is producing 2 over 1.2 as its using "ceil".

ameyajoshi99 commented 2 years ago

@shreyas-badiger , I did testing with suggested change. There are no errors in the background. Rollout was successful Here is PR: https://github.com/keikoproj/upgrade-manager/pull/329 Thanks a lot.

shreyas-badiger commented 2 years ago

@ameyajoshi99 I will merge the PR as soon as the CI is fixed.

shreyas-badiger commented 2 years ago

PR - https://github.com/keikoproj/upgrade-manager/pull/329 @ameyajoshi99 Please fix the DCO.

ameyajoshi99 commented 2 years ago

@shreyas-badiger Done.

keikoproj / upgrade-manager

Upgrade manager stuck in no InService instances. #324

Process: