Open preflightsiren opened 3 years ago
I have 2 ideas where the fault could be.
IsScalingGroupDrifted()
- awsprovider.SelectScalingGroup(r.RollingUpgrade.ScalingGroupName(), r.Cloud.ScalingGroups)
might return &autoscaling.Group{}
leaving for _, instance := range scalingGroup.Instances
to skip without doing any work.IsInstanceDrifted()
might not detect any changes due to error in aws request / caching.unfortunately there's no more verbose logging I can enable to validate.
I'm going to try setting forceRefresh:true
in the instanceGroups, and this might help diagnose into which is more likley.
This can be due to caching, is the modification to launch template happens outside of instance-manager? i.e. manually, or through some other process? When instance-manager makes a change (e.g. you change some configuration), caching is not in the picture since that write invalidates the cached item.
upgrade-manager is flushing the cache before every upgrade, so when a CR is received the cache is clean as far as I know.
@preflightsiren can you add your instance-group YAML and the resulting rolling-upgrade YAML as well
Interesting. The changes are via instance-manager. We just changed it to use forceRefresh: true
which tells me there's something inside IsInstanceDrifted()
which is acting weird.
I'll try and get more details when we have a failure.
had this happen today during patching I've attached an example of a working and affected InstanceGroup
and RollingUpgrade
yaml. 309.zip
I think more interesting is the logs for the broken RollingUpgrade
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
--
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-005a537c368db9bc8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-052105961e93c27df", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-056a89fcf2f036606", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0631bae2cdbaded68", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0692b82e4cd26c1d8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0a6180cc3f1658bb9", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0f7155d519873057f", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade no drift in scaling group {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
looks like upgrade.go#457 is being hit, meaning kubeprovider.SelectNodeByInstanceID(instanceID, r.Cloud.ClusterNodes)
is failing
finally got around to spending some time debugging, it looks like the r.Cloud
state becomes corrupt / out of date.
the raw log:
{"cluster":"uw2p-akp-b3","kubernetes":{"container_id":"docker://97face6b823c2bb6efac2b7a5c3f44f6401031f6327a7d99dc5c90aab93c65a3","container_image":"493203180918.dkr.ecr.us-west-2.amazonaws.com/akp-mirror/docker.io/keikoproj/rolling-upgrade-controller:v1.0.2-seb","container_name":"rolling-upgrade-controller","host":"ip-172-24-138-175.us-west-2.compute.internal","labels":{"app":"rolling-upgrade-controller","pod-template-hash":"76cd7b4ccd"},"namespace_name":"upgrade-manager","pod_id":"bee06a61-5ae5-4141-b27d-dadd0bae0eeb","pod_ip":"100.65.225.107","pod_name":"rolling-upgrade-controller-76cd7b4ccd-rm6tj","pod_owner":"ReplicaSet/rolling-upgrade-controller-76cd7b4ccd"},"log":"2022-02-14T01:18:54.252Z\tINFO\tcontrollers.RollingUpgrade\tNil Node details\t{\"instanceID\": \"i-00b789cfd28cd5942\", \"clusterNodes\": [\"ip-172-24-137-169.us-west-2.compute.internal\", \"ip-172-24-138-30.us-west-2.compute.internal\", \"ip-172-24-137-205.us-west-2.compute.internal\", \"ip-172-24-139-28.us-west-2.compute.internal\", \"ip-172-24-138-231.us-west-2.compute.internal\", \"ip-172-24-139-241.us-west-2.compute.internal\", \"ip-172-24-138-206.us-west-2.compute.internal\", \"ip-172-24-136-136.us-west-2.compute.internal\", \"ip-172-24-141-90.us-west-2.compute.internal\", \"ip-172-24-140-216.us-west-2.compute.internal\", \"ip-172-24-136-181.us-west-2.compute.internal\", \"ip-172-24-137-217.us-west-2.compute.internal\", \"ip-172-24-139-39.us-west-2.compute.internal\", \"ip-172-24-136-125.us-west-2.compute.internal\", \"ip-172-24-137-230.us-west-2.compute.internal\", \"ip-172-24-136-128.us-west-2.compute.internal\"]}","stream":"stderr"}
the node being rotated:
ip-172-24-138-175.us-west-2.compute.internal
providerID: aws:///us-west-2b/i-00b789cfd28cd5942
list contains 15 nodes, cluster has 145 nodes present, 21 uncordoned.
I could continue to dig, but it may just be worthwhile periodically restarting upgrade-manager like it's a NT2k program :)
@preflightsiren can you check out #317 and see if that fixes the issue? I believe it should
Maybe. It seems unlikely as the cache contains the list of nodes in the cluster, not information from ec2. Is there a connection I'm not aware of? Ill pull this patch into my debug branch and see if we still see the error.
The cache contains the launch template / versions, if a new version is created and the controller is unaware it would essentially skip the upgrade
Ok ive deployed the patched version to our environment. 🤞 We'll know with more certainty in about 2-3weeks
Even with #317 we still have the issue. I added some debug logging and can see:
Nil Node details {"instanceID": "i-05d7694376d812689", "clusterNodes": []}
the r.Cloud.ClusterNodes
is empty
Not sure how this is the case @shreyas-badiger any idea? Could this be a bug with how we use Event Filters here:
Do we ever do an initial list nodes when the controller starts up? if we don't it will take the controller 30-60 seconds to fill up the cache from node events initially.
@preflightsiren is it possible that this happens when the controller restarts? or is it spun up right when an upgrade is submitted?
I just checked and that's not the case, even with event filters the watch lists the nodes immediately when it's registered. Let me dig further and see if I can understand why this is happening.
is it possible that this happens when the controller restarts? no, the controller was not restarted during the upgrade.
I have an example of a node that was joined to the cluster long before the RollingUpgrade
was created, was not found the list of ClusterNodes
and the ClusterNodes
was not nil (it had 143 nodes in the cluster).
hope this adds some more flavour :)
Even with #317 we still have the issue. I added some debug logging and can see:
Nil Node details {"instanceID": "i-05d7694376d812689", "clusterNodes": []}
the
r.Cloud.ClusterNodes
is empty
@preflightsiren can you confirm if the nodes were "Ready" or "Not Ready" at this point in time? Interested to know whether nodes were already part of the cluster or not.
Sorry for the delay in the response, we've been busy running debugging during the upgrade process. During the upgrade process there's always a mix of ready and unready nodes, there's always ready nodes in the cluster servicing requests.
The one patch we've applied that seems to have the best result is setting --max-parallel=1
since we've applied that every environment and region has completed without issue.
Bump,I've ran into this is if the new node coming up never joins the cluster (bad AMI update etc) but the EC2 is valid member of the ASG. In my case I reverted the bad ami but the RU is stuck since the new node never joins so it can't continue since The cache map of kube nodes and instance is out of whack. Force removing the the instances from ASG/ and recycling the controller (to clear the cache map) and redeploying the CRDs works. Love the project btw. Issue happens every now and then.
Is this a BUG REPORT or FEATURE REQUEST?: Bug
What happened: A
RollingUpgrade
created after a modification to launch templates via instance-manager did not detect any nodes that need to be recycled. To resolve theRollingUpgrade
is deleted, and instance-manager is restarted, recreating theRollingUpgrade
What you expected to happen: Node is detected as being out-of-sync and replaced.
How to reproduce it (as minimally and precisely as possible): I can't consistently recreate this, but this happens often during our monthly patching cycle.
Anything else we need to know?:
Environment:
Other debugging information (if applicable):
RollingUpgrade status:
The rollingupgrade has already been replaced. I did see that the state was "completed".
controller logs: