RollingUpgrade not consistent

preflightsiren commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST?: Bug

What happened: A RollingUpgrade created after a modification to launch templates via instance-manager did not detect any nodes that need to be recycled. To resolve the RollingUpgrade is deleted, and instance-manager is restarted, recreating the RollingUpgrade

What you expected to happen: Node is detected as being out-of-sync and replaced.

How to reproduce it (as minimally and precisely as possible): I can't consistently recreate this, but this happens often during our monthly patching cycle.

Anything else we need to know?:

Environment:

rolling-upgrade-controller version: 1.0.2
Kubernetes version : 1.19.7 | 1.20.10
```
$ kubectl version -o yaml
```

Other debugging information (if applicable):

RollingUpgrade status:
```
$ kubectl describe rollingupgrade <rollingupgrade-name>
```
The rollingupgrade has already been replaced. I did see that the state was "completed".

controller logs:

$ kubectl logs <rolling-upgrade-controller pod>

2021-10-12T00:16:48.561Z    INFO    controllers.RollingUpgrade  admitted   new rolling upgrade  {"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
--
2021-10-12T00:16:55.636Z    INFO    controllers.RollingUpgrade  scaling   group details {"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z    INFO    controllers.RollingUpgrade  checking   if rolling upgrade is completed  {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z    INFO    controllers.RollingUpgrade  no   drift in scaling group {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.836Z    INFO    controllers.RollingUpgrade  rolling   upgrade ended {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:25.837Z    INFO    controllers.RollingUpgrade  rolling   upgrade ended {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:47.935Z    INFO    controllers.RollingUpgrade  rolling   upgrade ended {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}

***** RollingUpgrade is delete and instance-manager restarted *****

2021-10-12T00:20:59.590Z    INFO    controllers.RollingUpgrade  rolling   upgrade resource not found, deleted object from admission   map   {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:21:59.133Z    INFO    controllers.RollingUpgrade  admitted   new rolling upgrade  {"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z    INFO    controllers.RollingUpgrade  scaling   group details {"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z    INFO    controllers.RollingUpgrade  checking   if rolling upgrade is completed  {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z    INFO    controllers.RollingUpgrade  drift   detected in scaling   group {"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z    INFO    controllers.RollingUpgrade  selecting   batch for rotation  {"batch size": 1, "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z    INFO    controllers.RollingUpgrade  rotating   batch    {"instances": ["i-0017fd066bbfd0e32"],   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z    INFO    controllers.RollingUpgrade  setting   instances to in-progress  {"batch":   ["i-0017fd066bbfd0e32"], "instances(InService)":   ["i-0017fd066bbfd0e32"], "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.032Z    INFO    controllers.RollingUpgrade  setting   instances to stand-by {"batch": ["i-0017fd066bbfd0e32"],   "instances(InService)": ["i-0017fd066bbfd0e32"],   "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.433Z    INFO    controllers.RollingUpgrade  operating   on existing rolling upgrade {"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z    INFO    controllers.RollingUpgrade  scaling   group details {"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z    INFO    controllers.RollingUpgrade  checking   if rolling upgrade is completed  {"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z    INFO    controllers.RollingUpgrade  drift   detected in scaling   group {"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}

preflightsiren commented 3 years ago

I have 2 ideas where the fault could be.

in IsScalingGroupDrifted() - awsprovider.SelectScalingGroup(r.RollingUpgrade.ScalingGroupName(), r.Cloud.ScalingGroups) might return &autoscaling.Group{} leaving for _, instance := range scalingGroup.Instances to skip without doing any work.
IsInstanceDrifted() might not detect any changes due to error in aws request / caching.

unfortunately there's no more verbose logging I can enable to validate. I'm going to try setting forceRefresh:true in the instanceGroups, and this might help diagnose into which is more likley.

eytan-avisror commented 3 years ago

This can be due to caching, is the modification to launch template happens outside of instance-manager? i.e. manually, or through some other process? When instance-manager makes a change (e.g. you change some configuration), caching is not in the picture since that write invalidates the cached item.

upgrade-manager is flushing the cache before every upgrade, so when a CR is received the cache is clean as far as I know.

eytan-avisror commented 3 years ago

@preflightsiren can you add your instance-group YAML and the resulting rolling-upgrade YAML as well

preflightsiren commented 3 years ago

Interesting. The changes are via instance-manager. We just changed it to use forceRefresh: true which tells me there's something inside IsInstanceDrifted() which is acting weird.

I'll try and get more details when we have a failure.

preflightsiren commented 3 years ago

had this happen today during patching I've attached an example of a working and affected InstanceGroup and RollingUpgrade yaml. 309.zip

I think more interesting is the logs for the broken RollingUpgrade


2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
--
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-005a537c368db9bc8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-052105961e93c27df", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-056a89fcf2f036606", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0631bae2cdbaded68", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0692b82e4cd26c1d8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0a6180cc3f1658bb9", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0f7155d519873057f", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade no drift in scaling group {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}

preflightsiren commented 3 years ago

looks like upgrade.go#457 is being hit, meaning kubeprovider.SelectNodeByInstanceID(instanceID, r.Cloud.ClusterNodes) is failing

preflightsiren commented 2 years ago

finally got around to spending some time debugging, it looks like the r.Cloud state becomes corrupt / out of date. the raw log:

{"cluster":"uw2p-akp-b3","kubernetes":{"container_id":"docker://97face6b823c2bb6efac2b7a5c3f44f6401031f6327a7d99dc5c90aab93c65a3","container_image":"493203180918.dkr.ecr.us-west-2.amazonaws.com/akp-mirror/docker.io/keikoproj/rolling-upgrade-controller:v1.0.2-seb","container_name":"rolling-upgrade-controller","host":"ip-172-24-138-175.us-west-2.compute.internal","labels":{"app":"rolling-upgrade-controller","pod-template-hash":"76cd7b4ccd"},"namespace_name":"upgrade-manager","pod_id":"bee06a61-5ae5-4141-b27d-dadd0bae0eeb","pod_ip":"100.65.225.107","pod_name":"rolling-upgrade-controller-76cd7b4ccd-rm6tj","pod_owner":"ReplicaSet/rolling-upgrade-controller-76cd7b4ccd"},"log":"2022-02-14T01:18:54.252Z\tINFO\tcontrollers.RollingUpgrade\tNil Node details\t{\"instanceID\": \"i-00b789cfd28cd5942\", \"clusterNodes\": [\"ip-172-24-137-169.us-west-2.compute.internal\", \"ip-172-24-138-30.us-west-2.compute.internal\", \"ip-172-24-137-205.us-west-2.compute.internal\", \"ip-172-24-139-28.us-west-2.compute.internal\", \"ip-172-24-138-231.us-west-2.compute.internal\", \"ip-172-24-139-241.us-west-2.compute.internal\", \"ip-172-24-138-206.us-west-2.compute.internal\", \"ip-172-24-136-136.us-west-2.compute.internal\", \"ip-172-24-141-90.us-west-2.compute.internal\", \"ip-172-24-140-216.us-west-2.compute.internal\", \"ip-172-24-136-181.us-west-2.compute.internal\", \"ip-172-24-137-217.us-west-2.compute.internal\", \"ip-172-24-139-39.us-west-2.compute.internal\", \"ip-172-24-136-125.us-west-2.compute.internal\", \"ip-172-24-137-230.us-west-2.compute.internal\", \"ip-172-24-136-128.us-west-2.compute.internal\"]}","stream":"stderr"}

the node being rotated: ip-172-24-138-175.us-west-2.compute.internal providerID: aws:///us-west-2b/i-00b789cfd28cd5942

list contains 15 nodes, cluster has 145 nodes present, 21 uncordoned.

I could continue to dig, but it may just be worthwhile periodically restarting upgrade-manager like it's a NT2k program :)

eytan-avisror commented 2 years ago

@preflightsiren can you check out #317 and see if that fixes the issue? I believe it should

preflightsiren commented 2 years ago

Maybe. It seems unlikely as the cache contains the list of nodes in the cluster, not information from ec2. Is there a connection I'm not aware of? Ill pull this patch into my debug branch and see if we still see the error.

eytan-avisror commented 2 years ago

The cache contains the launch template / versions, if a new version is created and the controller is unaware it would essentially skip the upgrade

preflightsiren commented 2 years ago

Ok ive deployed the patched version to our environment. 🤞 We'll know with more certainty in about 2-3weeks

preflightsiren commented 2 years ago

Even with #317 we still have the issue. I added some debug logging and can see:

Nil Node details    {"instanceID": "i-05d7694376d812689", "clusterNodes": []}

the r.Cloud.ClusterNodes is empty

eytan-avisror commented 2 years ago

Not sure how this is the case @shreyas-badiger any idea? Could this be a bug with how we use Event Filters here:

https://github.com/keikoproj/upgrade-manager/blob/8e0f67db323ec5b7bea0fd4d9f96d23f499e7e66/controllers/rollingupgrade_controller.go#L197-L232

Do we ever do an initial list nodes when the controller starts up? if we don't it will take the controller 30-60 seconds to fill up the cache from node events initially.

@preflightsiren is it possible that this happens when the controller restarts? or is it spun up right when an upgrade is submitted?

eytan-avisror commented 2 years ago

I just checked and that's not the case, even with event filters the watch lists the nodes immediately when it's registered. Let me dig further and see if I can understand why this is happening.

preflightsiren commented 2 years ago

is it possible that this happens when the controller restarts? no, the controller was not restarted during the upgrade.

I have an example of a node that was joined to the cluster long before the RollingUpgrade was created, was not found the list of ClusterNodes and the ClusterNodes was not nil (it had 143 nodes in the cluster).

hope this adds some more flavour :)

shreyas-badiger commented 2 years ago

Even with #317 we still have the issue. I added some debug logging and can see:
Nil Node details  {"instanceID": "i-05d7694376d812689", "clusterNodes": []}
the r.Cloud.ClusterNodes is empty

@preflightsiren can you confirm if the nodes were "Ready" or "Not Ready" at this point in time? Interested to know whether nodes were already part of the cluster or not.

preflightsiren commented 2 years ago

Sorry for the delay in the response, we've been busy running debugging during the upgrade process. During the upgrade process there's always a mix of ready and unready nodes, there's always ready nodes in the cluster servicing requests.

The one patch we've applied that seems to have the best result is setting --max-parallel=1 since we've applied that every environment and region has completed without issue.

CAR6807 commented 1 year ago

Bump,I've ran into this is if the new node coming up never joins the cluster (bad AMI update etc) but the EC2 is valid member of the ASG. In my case I reverted the bad ami but the RU is stuck since the new node never joins so it can't continue since The cache map of kube nodes and instance is out of whack. Force removing the the instances from ASG/ and recycling the controller (to clear the cache map) and redeploying the CRDs works. Love the project btw. Issue happens every now and then.

keikoproj / upgrade-manager

RollingUpgrade not consistent #309