kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
636 stars 560 forks source link

EC2 machine failed to join EKS cluster as a node using ubuntu ami #5064

Open t4i5m6 opened 2 months ago

t4i5m6 commented 2 months ago

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] When I tried to use amazon latest ubuntu ami "ami-07edf4c2ac90845dc" in my worker node, it failed to join the cluster. Will show "MachineDeployment/ False Warning WaitingForAvailableMachines Minimum availability requires 2 replicas, current 0 available" forever

And then I went to the ec2 machine. the cloud-init data has the error message with

024-07-21 00:55:14,881 - util.py[WARNING]: failed stage init
failed run of stage init
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 78, in read_file_or_url
    with open(file_path, "rb") as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 237, in _do_include
    resp = read_file_or_url(
  File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 84, in read_file_or_url
    raise UrlError(cause=e, code=code, headers=None, url=url) from e
cloudinit.url_helper.UrlError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 781, in status_wrapper
    ret = functor(name, args)
  File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 463, in main_init
    init.update()
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 511, in update
    self._store_processeddata(self.datasource.get_userdata(), "userdata")
  File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 595, in get_userdata
    self.userdata = self.ud_proc.process(self.get_userdata_raw())
  File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 87, in process
    self._process_msg(convert_string(blob), accumulating_msg)
  File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 158, in _process_msg
    self._do_include(payload, append_msg)
  File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 263, in _do_include
    _handle_error(message, urle)
  File "/usr/lib/python3/dist-packages/cloudinit/user_data.py", line 71, in _handle_error
    raise RuntimeError(error_message) from source_exception
RuntimeError: [Errno 2] No such file or directory: '/etc/secret-userdata.txt' for url: file:///etc/secret-userdata.txt

After doing some investigation, I thought the problem is for the bootstrapping and then I manually use ssm manager to login to the ec2 machine and run the /etc/eks/bootstrap.sh Then the ec2 machine can successfully join the cluster.

However, if I use the bootstrapCommandOveride in the EKSConfigTemplate object, it doesn't take effect.

I also tried on the CAPI ubuntu image but it didn't produce the cloud-init error and have user data but there is no /etc/eks/bootstrap.sh there.

Any advice to create an EKS cluster with ubuntu ami ?

What did you expect to happen: The ec2 machine can join the cluster

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] The yaml file I used to creat eks cluster

kind: Cluster
metadata:
  name: capi-eks-quickstart
  namespace: default
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta2
    kind: AWSManagedControlPlane
    name: capi-eks-quickstart-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSManagedCluster
    name: capi-eks-quickstart
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
metadata:
  name: capi-eks-quickstart
  namespace: default
spec: {}
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: AWSManagedControlPlane
metadata:
  name: capi-eks-quickstart-control-plane
  namespace: default
spec:
  region: us-west-2
  sshKeyName: tim-test-0718
  version: v1.29.0
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: capi-eks-quickstart-md-0
  namespace: default
spec:
  clusterName: capi-eks-quickstart
  replicas: 3
  selector:
    matchLabels: null
  template:
    spec:
      bootstrap:
        dataSecretName: capi-eks-quickstart-kubeconfig
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfigTemplate
          name: capi-eks-quickstart-md-0
      clusterName: capi-eks-quickstart
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachineTemplate
        name: capi-eks-quickstart-md-0
      version: v1.29.0
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
metadata:
  name: capi-eks-quickstart-md-0
  namespace: default
spec:
  template:
    spec:
      ami:
        id: ami-07edf4c2ac90845dc
      iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
      instanceType: t3.large
      sshKeyName: tim-test-0718
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfigTemplate
metadata:
  name: capi-eks-quickstart-md-0
  namespace: default
spec:
  template:
    spec:
      containerRuntime: containerd
      boostrapCommandOverride: |
        #!/bin/bash
        /etc/eks/bootstrap.sh capi-eks-quickstart

Environment:

k8s-ci-robot commented 2 months ago

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.