Default NodeGroup not restored correctly on revert

bothra90 commented 2 years ago

What happened?

We were looking to migrate away from the default nodegroup created by eks crosswalk and instead use multiple managed node groups as part of our EKS setup. We disabled the default NodeGroup by passing skipDefaultNodeGroup: true to the constructor for eks.Cluster. and attached two managed node groups to the cluster by running new eks.ManagedNodeGroup(...).

Unfortunately, we ran into some permission related issues during the upgrade and decided to revert the changes. On reverting the changes, however, we noticed that our worker nodes were not able to join the cluster.

The nodes themselves thought they were "Ready":

Events:
  Type    Reason                   Age                From        Message
  ----    ------                   ----               ----        -------
  Normal  Starting                 39m                kubelet     Starting kubelet.
  Normal  NodeHasSufficientMemory  39m (x2 over 39m)  kubelet     Node ip-10-104-156-187.ap-south-1.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    39m (x2 over 39m)  kubelet     Node ip-10-104-156-187.ap-south-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     39m (x2 over 39m)  kubelet     Node ip-10-104-156-187.ap-south-1.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  39m                kubelet     Updated Node Allocatable limit across pods
  Normal  Starting                 39m                kube-proxy  Starting kube-proxy.
  Normal  NodeReady                38m                kubelet     Node ip-10-104-156-187.ap-south-1.compute.internal status is now: NodeReady

But EKS was reporting them as NotReady:

NAME                                            STATUS     ROLES    AGE   VERSION
ip-10-104-156-187.ap-south-1.compute.internal   NotReady   <none>   45m   v1.22.6-eks-7d68063
ip-10-104-182-1.ap-south-1.compute.internal     NotReady   <none>   45m   v1.22.6-eks-7d68063
ip-10-104-239-122.ap-south-1.compute.internal   NotReady   <none>   45m   v1.22.6-eks-7d68063

Eventually, we saw the following in the worker logs:

{
    "hostname": "ip-10-104-182-1.ap-south-1.compute.internal",
    "systemd_unit": "kubelet.service",
    "message": "E0526 19:32:27.106811    3090 kubelet_node_status.go:457] \"Unable to update node status\" err=\"update node status exceeds retry count\"",
    "az": "ap-south-1a",
    "ec2_instance_id": "i-0b2e054a340b8eb29"
}

On the control plane, we then saw:

time="2022-05-26T22:30:39Z" level=warning msg="access denied" arn="arn:aws:iam::xxxxxxxxxx:role/eks-cluster-instanceRole-role-xxxxxxx" client="127.0.0.1:38704" error="ARN is not mapped" method=POST path=/authenticate

We confirmed that the ARN was not mapped by looking at the aws-auth configmap:

$ k get cm/aws-auth -n kube-system -o yaml
apiVersion: v1
data:
  mapRoles: |
    null
kind: ConfigMap
metadata:
   ....

On a healthy cluster, the output of the above command looks like:

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::xxxxxxxxxx:role/eks-cluster-instanceRole-role-xxxxxxx
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  ....

Steps to reproduce

Create an EKS cluster with default NodeGroups using https://www.pulumi.com/docs/guides/crosswalk/aws/eks/
Once the cluster is created, disable default node groups and instead attach managed nodegroups
Revert changes in step 2
Confirm that after 3, the aws-auth configmap is not correctly restored
Initially the nodes might show up as ready (not sure why), but after about 10 mins they transition to NotReady (from the POV of the control plane) and then stay there.

Expected Behavior

On reverting the changes, the nodes created in the default node group should be able to join the cluster

Actual Behavior

Nodes created in the default node group were not able to join the cluster

Versions used

$ pulumi version
v3.33.1

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

bothra90 commented 2 years ago

@viveklak: Sorry, I think I should have filed this under https://github.com/pulumi/pulumi-eks/. Can you migrate? Thanks!

viveklak commented 2 years ago

While I haven't tried to reproduce this, looking at the code base, I don't think this is a complete surprise, given the complex interactions leading to the creation of the singleton aws-auth config map. I wonder if the right thing to do is to mark some of these fields like skipDefaultNodeGroup etc. as requiring replacement on change?

bothra90 commented 2 years ago

I wonder if the right thing to do is to mark some of these fields like skipDefaultNodeGroup etc. as requiring replacement on change?

Do you mean recreate the eks cluster entirely? Isn't that very disruptive? In our case, for context, we create load-balancer services on our EKS cluster and they are exported to our customers via AWS VPC end-point services and these would become invalid on a recreate. We are also at early stages of running some stateful services in our EKS cluster and they would be completely wiped I presume.

t0yv0 commented 1 month ago

Per @flostadler this scenario may be getting a little easier with the upcoming work on EKS modernization when using API Authentication mode instead of the AWS auth ConfigMap, we'll have another look here as time permits!

pulumi / pulumi-eks