Closed ameyajoshi99 closed 2 years ago
Thanks @ameyajoshi99 We've addressed some issues around LaunchTemplate caching in https://github.com/keikoproj/upgrade-manager/pull/322 which is not in the latest release. Could you try out :master tag and see if that works better? We can create a release with this fix if needed.
Also, you mention that all instances in StandBy - are there no new instances inservice? when an instance is set to standby, a new one should automatically launch and should be InService. Can you look at the ASG's activity history to see if there was a failure to launch new instances for some reason?
CC @shreyas-badiger
failed to set instances to standby
This is happening because of a caching issue. Now, (in latest code on master) We are flushing the ec2 object upon every reconcile operation therefore you shouldn't see this issue anymore.
no inService instances in batch
This message means that amongst the instances in the batch (that are considered for rotation) they all have been set to 'StandBy' and they aren't 'InService' anymore. This was done to avoid some of the corner cases where we will end up setting same instances to 'StandBy' multiple times for which AWS APIs will return error. So, this isn't the final message or a log message of concern. You should look for next few lines. "New nodes yet to join" or "new instances yet to join"
Bottom line, launch templates roll up could either get stuck or fail or not start at all because upgrade-manager could be operating on stale data. We have fixed that. Next release should address this issue.
Please try the latest code from upgrade-manager branch. And let us know if you still have the issue.
Would be good to add a hotfix release with the latest changes
@eytan-avisror agreed. I am OOO today. Will consider doing it tomorrow.
@eytan-avisror / @shreyas-badiger Thanks for the response .. will try out the fix.
I tried out the the master branch. Seems like that did work for cluster with 15 nodes. However I noticed few errors in the log. This did not failed job.
{"level":"info","ts":1650889090.4844146,"logger":"controllers.RollingUpgrade","msg":"***Reconciling***"}
{"level":"info","ts":1650889090.4844556,"logger":"controllers.RollingUpgrade","msg":"operating on existing rolling upgrade","scalingGroup":"asg-3","update strategy":{"type":"randomUpdate","mode":"eager","maxUnavailable":"20%","drainTimeout":300},"name":"upgrade-manager/asg-3-220220330103350910100000024"}
{"level":"info","ts":1650889090.540366,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-abcd","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2a","HealthStatus":"Healthy","InstanceId":"i-pqrs","InstanceType":"m6i.xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-3","LaunchTemplateName":"lt-3-120220330103350515900000021","Version":"1"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-pqrs is not in InService.\n\tstatus code: 400, request id: 99b4774a-cc2d-4a18-9b09-5ae3855cc904","name":"upgrade-manager/asg-3-120220330103350918000000025"}
Looks like an instance which is not inService (either in pending or terminating state, is being attempted for a standby) I shall look into it. Can you share complete logs in a file?
upgrade-manager.log Hi attached log file.
@ameyajoshi99 I looked into the logs. Let me explain what is happening. I am explaining everything in detail so that you also understand and are encouraged to contribute.
NOTE: This comment explains the process, and next comment talks about the error you are facing.
Once a CR is admitted, we start processing the nodes for rotation batch-by-batch. The batch size is determined by maxUnavailable
.(In your case, the maxUnavailable is set to 20% and the batch size is 2.)
How do we select a batch?
A batch is selected either Uniformly across all AZs or they are selected Randomly.
Priority is given to the instances that were already InProgress
. A tag that gets attached to the instance when we processed it for the first time.
Something like this:
{"level":"info","ts":1651060486.7610822,"logger":"controllers.RollingUpgrade","msg":"found in-progress instances","instances":["i-abcdefg","i-pqrstu"]}
These steps are followed when we are processing a batch: (Eager mode, Lazy mode skips the waiting in step 3,4,5.)
InService
. Ready
state.The last step marks completion of processing a batch. We repeat the above steps until we achieve the finite state for the CR. i.e. All the instances in the ASG have the same launch config/ launch template as ASG.
Now, talking about the error you are facing:
{"level":"info","ts":1651060486.9737825,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by",
"instances":[{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-pqrstu","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},
"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2c","HealthStatus":"Healthy","InstanceId":"i-abcdefg","InstanceType":"m6i.2xlarge",
"LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-abcdef","LaunchTemplateName":"lt-3-22022033010335050860000001e","Version":"3"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-abcdefg is not in InService.\n\tstatus code: 400, request id: 54bf7c35-c9e3-438f-90ea-ea6b760afd29","name":"upgrade-manager/asg-3-220220330103350910100000024"}
We added this check to make sure we follow the step 2 mentioned above doesn't hit errors while setting instances to StandBy
. (AWS API to setInstanceStandBy is allowed only on the instances that are InService
.)
I am doing the right thing by setting the instances that are InService
in batch to InProgress
.
However, I am not following the same when I am setting instances to StandBy
Basically the bug is in this line:
....
if err := r.SetBatchStandBy(batchInstanceIDs); err != nil {
...
...
Instead, it should have been:
....
if err := r.SetBatchStandBy(inServiceInstanceIDs); err != nil {
...
...
Can you make this change, test and send out a PR for this?
@shreyas-badiger Thanks a lot for detailed explanation .. That was really helpful .. I will do the change you suggested and test it out on the cluster. If works well, I'll add PR here ...
About batch, we have not specified any strategy type, so if & elseif in https://github.com/keikoproj/upgrade-manager/blob/8e0f67db323ec5b7bea0fd4d9f96d23f499e7e66/controllers/upgrade.go#L419 is not coming into picture .. So the result of CalculateMaxUnavailable is getting used.
We've 6 total nodes in 1 asg & maxUnavailable is 20%. intstr.GetValueFromIntOrPercent is producing 2 over 1.2 as its using "ceil".
@shreyas-badiger , I did testing with suggested change. There are no errors in the background. Rollout was successful Here is PR: https://github.com/keikoproj/upgrade-manager/pull/329 Thanks a lot.
@ameyajoshi99 I will merge the PR as soon as the CI is fixed.
PR - https://github.com/keikoproj/upgrade-manager/pull/329 @ameyajoshi99 Please fix the DCO.
@shreyas-badiger Done.
Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT
What happened: We are using release v1.0.4 of upgrade manager. The manager is not able to complete the rollout, especially for cluster with nodes roughly greater than 20.
We've seen couple of errors in the logs
In both cases, these logs start to repeat till manager fails. Failure time is around in 1hr. When we look at the ASG, the nodes in the logs are in standby state. The logs in upgrade manager for setting node to standby and the time in ASG match so we can say upgrade manager did set the nodes to stand by. However upgrade manager start sending out above logs and is stuck in that error till it fails. Manager shows the same logs even if we manually drain and delete the node.
What you expected to happen: Upgrade manager should rollout the nodes.
Environment: