keikoproj / upgrade-manager

Reliable, extensible rolling-upgrades of Autoscaling groups in Kubernetes
Apache License 2.0
141 stars 45 forks source link

Upgrades fail with instances not available for some asg #19

Closed kianjones4 closed 5 years ago

kianjones4 commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT

What happened: upgrade failed for kafka cluster with zookeeper saying Instances not available, and rollup objects stuck in error state

What you expected to happen: Instances in the zk-nodes and kafka-nodes asgs to be upgraded appropriately.

How to reproduce it (as minimally and precisely as possible): Create asg with name zk-nodes.kafka-test.cluster.k8s.local or kafka-nodes.kafka-test.cluster.k8s.local in aws, then try to submit a rollup object with spec.AsgName: zk-nodes.kafka-test.cluster.k8s.local or kafka-nodes.kafka-test.cluster.k8s.local and check the logs of the rollup controller pod

Anything else we need to know?: This doesn't seem to be a problem with all asg names. My other rollups foo-bar1 and iks-system upgraded successfully

Environment: AWS

Other debugging information (if applicable):

- controller logs:

$ kubectl logs 2019/09/26 13:20:38 error: Instances are not available for update occurred for rollup-kafka-nodes-1.21.0-snapshot-kafka-nodes.kafka-test.cluster.k8s.local-20190926194335 2019/09/26 13:20:38 error: Instances are not available for update occurred for rollup-kafka-nodes-1.21.0-snapshot-kafka-nodes.kafka-test.cluster.k8s.local-20190926194335 2019/09/26 13:20:38 Deleted the entries of ASG kafka-nodes.kafka-test.cluster.k8s.local in the cluster store for rollup-kafka-nodes-1.21.0-snapshot-kafka-nodes.kafka-test.cluster.k8s.local-20190926194335 2019/09/26 13:20:38 Marked object rollup-kafka-nodes-1.21.0-snapshot-kafka-nodes.kafka-test.cluster.k8s.local-20190926194335 as error 2019/09/26 13:20:38 Deleted rollup-kafka-nodes-1.21.0-snapshot-kafka-nodes.kafka-test.cluster.k8s.local-20190926194335 from admission map 0xc0001fd740

shrinandj commented 5 years ago

Note that the bug itself does not have anything to do with Kafka. If a cluster has multiple instance-groups (ASGs) where name of one is a substring of other/s, upgrade-manager can get confused and cause rolling-upgrade of instances in different IGs.