Open bothra90 opened 2 years ago
@viveklak: Sorry, I think I should have filed this under https://github.com/pulumi/pulumi-eks/. Can you migrate? Thanks!
While I haven't tried to reproduce this, looking at the code base, I don't think this is a complete surprise, given the complex interactions leading to the creation of the singleton aws-auth config map. I wonder if the right thing to do is to mark some of these fields like skipDefaultNodeGroup
etc. as requiring replacement on change?
I wonder if the right thing to do is to mark some of these fields like skipDefaultNodeGroup etc. as requiring replacement on change?
Do you mean recreate the eks cluster entirely? Isn't that very disruptive? In our case, for context, we create load-balancer services on our EKS cluster and they are exported to our customers via AWS VPC end-point services and these would become invalid on a recreate. We are also at early stages of running some stateful services in our EKS cluster and they would be completely wiped I presume.
Per @flostadler this scenario may be getting a little easier with the upcoming work on EKS modernization when using API Authentication mode instead of the AWS auth ConfigMap, we'll have another look here as time permits!
What happened?
We were looking to migrate away from the default nodegroup created by eks crosswalk and instead use multiple managed node groups as part of our EKS setup. We disabled the default NodeGroup by passing
skipDefaultNodeGroup: true
to the constructor foreks.Cluster
. and attached two managed node groups to the cluster by runningnew eks.ManagedNodeGroup(...)
.Unfortunately, we ran into some permission related issues during the upgrade and decided to revert the changes. On reverting the changes, however, we noticed that our worker nodes were not able to join the cluster.
The nodes themselves thought they were "Ready":
But EKS was reporting them as
NotReady
:Eventually, we saw the following in the worker logs:
On the control plane, we then saw:
We confirmed that the ARN was not mapped by looking at the
aws-auth
configmap:On a healthy cluster, the output of the above command looks like:
Steps to reproduce
aws-auth
configmap is not correctly restoredExpected Behavior
On reverting the changes, the nodes created in the default node group should be able to join the cluster
Actual Behavior
Nodes created in the default node group were not able to join the cluster
Versions used
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).