k0sproject / k0smotron

k0smotron
https://docs.k0smotron.io/
Other
465 stars 43 forks source link

Worker node is not registered in kubernetes #665

Closed eromanova closed 1 week ago

eromanova commented 1 month ago

k0smotron.logs.txt k0s.logs.txt

Versions

Steps to reproduce

  1. Deploy cluster with k0s as a bootstrap provider and aws as infrastructure provider using k0smotron and cluster API (simple setup for testing: 1 control plane, 1 worker). All cluster objects are provided below.
  2. Wait for the cluster to be provisioned.

Expected result

The cluster is deployed successfully.

Actual result

The worker node is not registered in kubernetes while the control plane node is registered successfully.

eromanova@Ekaterina-Romanova-MacBook-Pro-13-inch-M1-2020- hmc % k --kubeconfig ekaz/kubeconfig get no
NAME            STATUS   ROLES           AGE   VERSION
ip-10-0-7-150   Ready    control-plane   38m   v1.30.2+k0s

eromanova@Ekaterina-Romanova-MacBook-Pro-13-inch-M1-2020- hmc % k -n hmc-system get machines
NAME                       CLUSTER     NODENAME        PROVIDERID                              PHASE         AGE   VERSION
ekaz-test-cp-0             ekaz-test   ip-10-0-7-150   aws:///us-east-2a/i-03e46480d55105180   Running       46m   v1.30.2
ekaz-test-md-hvrcs-j4hg2   ekaz-test                   aws:///us-east-2a/i-073226730b49a726e   Provisioned   46m

eromanova@Ekaterina-Romanova-MacBook-Pro-13-inch-M1-2020- hmc % clusterctl describe cluster ekaz-test -n hmc-system --show-conditions all
NAME                                                                READY  SEVERITY  REASON                       SINCE  MESSAGE
Cluster/ekaz-test                                                   True                                          40m
│           ├─ControlPlaneInitialized                               True                                          40m
│           ├─ControlPlaneReady                                     True                                          40m
│           └─InfrastructureReady                                   True                                          42m
├─ClusterInfrastructure - AWSCluster/ekaz-test                      True                                          42m
│             ├─ClusterSecurityGroupsReady                          True                                          43m
│             ├─InternetGatewayReady                                True                                          45m
│             ├─LoadBalancerReady                                   True                                          42m
│             ├─NatGatewaysReady                                    True                                          43m
│             ├─RouteTablesReady                                    True                                          43m
│             ├─SubnetsReady                                        True                                          45m
│             └─VpcReady                                            True                                          45m
├─ControlPlane - K0sControlPlane/ekaz-test-cp
│ └─Machine/ekaz-test-cp-0                                          True                                          39m
│   │           ├─BootstrapReady                                    True                                          39m
│   │           ├─InfrastructureReady                               True                                          39m
│   │           └─NodeHealthy                                       True                                          36m
│   └─BootstrapConfig - K0sControllerConfig/ekaz-test-cp-0
└─Workers
  └─MachineDeployment/ekaz-test-md                                  False  Warning   WaitingForAvailableMachines  45m    Minimum availability requires 1 replicas, current 0 available
    │           ├─Available                                         False  Warning   WaitingForAvailableMachines  45m    Minimum availability requires 1 replicas, current 0 available
    │           └─MachineSetReady                                   False  Warning   ScalingUp                    45m    Scaling up MachineSet to 1 replicas (actual 0)
    └─Machine/ekaz-test-md-hvrcs-j4hg2                              True                                          34m
      │           ├─BootstrapReady                                  True                                          34m
      │           ├─InfrastructureReady                             True                                          34m
      │           └─NodeHealthy                                     False  Warning   NodeProvisioning             34m
      └─BootstrapConfig - K0sWorkerConfig/ekaz-test-md-hvrcs-j4hg2

k0smotron logs contain the following:

2024-08-06T15:42:39Z    INFO    waiting for node to be available for machine ekaz-test-md-hvrcs-j4hg2   {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "ekaz-test-md-hvrcs-j4hg2", "reconcileID": "f1b19799-640d-4b31-8571-6d2635b71577", "providerID": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}}
2024-08-06T15:42:49Z    INFO    Reconciling machine's ProviderID    {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "ekaz-test-md-hvrcs-j4hg2", "reconcileID": "ccd99d84-be09-4c28-add8-e97a044dd8da", "providerID": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}}
2024-08-06T15:42:51Z    ERROR   Reconciler error    {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "ekaz-test-md-hvrcs-j4hg2", "reconcileID": "ccd99d84-be09-4c28-add8-e97a044dd8da", "error": "failed to update node 'ip-10-0-12-1' with providerID: Operation cannot be fulfilled on nodes \"ip-10-0-12-1\": StorageError: invalid object, Code: 4, Key: /registry/minions/ip-10-0-12-1, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 51288d99-69b2-4e07-944f-a5588860dbb2, UID in object meta: "}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.5/pkg/internal/controller/controller.go:227
2024-08-06T15:42:51Z    INFO    Reconciling machine's ProviderID    {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "ekaz-test-md-hvrcs-j4hg2", "reconcileID": "01498519-026f-4bcd-ac97-1bae3b289fd9", "providerID": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}}
2024-08-06T15:42:51Z    INFO    waiting for node to be available for machine ekaz-test-md-hvrcs-j4hg2   {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "ekaz-test-md-hvrcs-j4hg2", "reconcileID": "01498519-026f-4bcd-ac97-1bae3b289fd9", "providerID": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}}
2024-08-06T15:43:01Z    INFO    Reconciling machine's ProviderID    {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}, "namespace": "hmc-system", "name": "ekaz-test-md-hvrcs-j4hg2", "reconcileID": "f957bff0-ea3d-4fb6-870d-c826e8e9e198", "providerID": {"name":"ekaz-test-md-hvrcs-j4hg2","namespace":"hmc-system"}}

K0s logs don't contain anything interesting. It looks like it successfully registered the node but then it starts arguing about it's absence:

Aug  6 15:42:41 ip-10-0-12-1 cloud-init[1246]: Cloud-init v. 22.2-0ubuntu1~22.04.1 running 'modules:final' at Tue, 06 Aug 2024 15:42:41 +0000. Up 22.42 seconds.
Aug  6 15:42:42 ip-10-0-12-1 cloud-init[1246]: Downloading k0s from URL: https://github.com/k0sproject/k0s/releases/download/v1.30.2+k0s.0/k0s-v1.30.2+k0s.0-amd64
Aug  6 15:42:43 ip-10-0-12-1 cloud-init[1246]: k0s is now executable in /usr/local/bin
...
Aug  6 15:42:46 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:46" level=info msg="I0806 15:42:46.310871    1364 kubelet_node_status.go:73] \"Attempting to register node\" node=\"ip-10-0-12-1\"" component=kubelet stream=stderr
Aug  6 15:42:46 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:46" level=info msg="I0806 15:42:46.319580    1364 kubelet_node_status.go:76] \"Successfully registered node\" node=\"ip-10-0-12-1\"" component=kubelet stream=stderr
Aug  6 15:42:46 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:46" level=info msg="I0806 15:42:46.323162    1364 kubelet_network_linux.go:50] \"Initialized iptables rules.\" protocol=\"IPv4\"" component=kubelet stream=stderr
Aug  6 15:42:46 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:46" level=info msg="I0806 15:42:46.324850    1364 kubelet_network_linux.go:50] \"Initialized iptables rules.\" protocol=\"IPv6\"" component=kubelet stream=stderr
Aug  6 15:42:46 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:46" level=info msg="I0806 15:42:46.324874    1364 status_manager.go:217] \"Starting to sync pod status with apiserver\"" component=kubelet stream=stderr
...
Aug  6 15:42:54 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:54" level=info msg="E0806 15:42:54.484051    1364 kubelet_node_status.go:462] \"Error getting the current node from lister\" err=\"node \\\"ip-10-0-12-1\\\" not found\"" component=kubelet stream=stderr
Aug  6 15:42:54 ip-10-0-12-1 k0s[1324]: time="2024-08-06 15:42:54" level=info msg="E0806 15:42:54.584557    1364 kubelet_node_status.go:462] \"Error getting the current node from lister\" err=\"node \\\"ip-10-0-12-1\\\" not found\"" component=kubelet stream=stderr

While trying to reproduce this issue, I noticed that the node was initially present in the cluster but after about 10 seconds it was removed from the list. Probably, the error in the k0smostron controller relates somehow because according to the k0smotron logs the events were as follows:

UPD: AWS CCM logs contain the following:

I0812 13:45:27.449688       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:27.543636       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:27.543700       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:27.609352       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:27.609382       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:27.696292       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:27.696323       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:27.865615       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:27.865644       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:27.935005       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:27.935037       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:28.026238       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:28.026272       1 node_controller.go:427] Initializing node ip-10-0-12-106 with cloud provider
E0812 13:45:28.094123       1 node_controller.go:236] error syncing 'ip-10-0-12-106': failed to get provider ID for node ip-10-0-12-106 at cloudprovider: failed to get instance ID from cloud provider: instance not found, requeuing
I0812 13:45:28.144522       1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: ip-10-0-12-106
I0812 13:45:28.146612       1 event.go:307] "Event occurred" object="ip-10-0-12-106" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node ip-10-0-12-106 because it does not exist in the cloud provider"
I0812 13:45:28.174162       1 controller.go:695] Syncing backends for all LB services.

So, it looks like AWS ccm deleted the node.

Please, let me know if more details are needed.

My cluster objects:

---
# Source: aws-standalone-cp/templates/awscluster.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
metadata:
  name: ekaz-test
spec:
  region: us-east-2
  # identityRef:
    # kind: AWSClusterStaticIdentity
    # name: aws-identity-name
  controlPlaneLoadBalancer:
    healthCheckProtocol: TCP
---
# Source: aws-standalone-cp/templates/awsmachinetemplate-controlplane.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
metadata:
  name: ekaz-test-cp-mt
spec:
  template:
    spec:
      ami:
        id: ami-02f3416038bdb17fb
      instanceType: t3.small
      # Instance Profile created by `clusterawsadm bootstrap iam create-cloudformation-stack`
      iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
      cloudInit:
        # Makes CAPA use k0s bootstrap cloud-init directly and not via SSM
        # Simplifies the VPC setup as we do not need custom SSM endpoints etc.
        insecureSkipSecretsManager: true
      sshKeyName: "ekaz"
      publicIP: true
---
# Source: aws-standalone-cp/templates/awsmachinetemplate-worker.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
metadata:
  name: ekaz-test-worker-mt
spec:
  template:
    spec:
      ami:
        id: ami-02f3416038bdb17fb
      instanceType: t3.small
      # Instance Profile created by `clusterawsadm bootstrap iam create-cloudformation-stack`
      iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
      cloudInit:
        # Makes CAPA use k0s bootstrap cloud-init directly and not via SSM
        # Simplifies the VPC setup as we do not need custom SSM endpoints etc.
        insecureSkipSecretsManager: true
      sshKeyName: "ekaz"
      publicIP: true
---
# Source: aws-standalone-cp/templates/cluster.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: ekaz-test
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 10.244.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: K0sControlPlane
    name: ekaz-test-cp
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: ekaz-test
---
# Source: aws-standalone-cp/templates/k0scontrolplane.yaml
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: K0sControlPlane
metadata:
  name: ekaz-test-cp
spec:
  replicas: 1
  version: v1.30.2+k0s.0
  k0sConfigSpec:
    args:
      - --enable-worker
      - --enable-cloud-provider
      - --kubelet-extra-args="--cloud-provider=external"
      - --disable-components=konnectivity-server
    k0s:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        name: ekaz-test-k0sconfig
        namespace: default
      spec:
        api:
          extraArgs:
            anonymous-auth: "true"
        network:
          provider: calico
          calico:
            mode: ipip
        extensions:
          helm:
            repositories:
              - name: aws-cloud-controller-manager
                url: https://kubernetes.github.io/cloud-provider-aws
              - name: aws-ebs-csi-driver
                url: https://kubernetes-sigs.github.io/aws-ebs-csi-driver
            charts:
              - name: aws-cloud-controller-manager
                namespace: kube-system
                chartname: aws-cloud-controller-manager/aws-cloud-controller-manager
                version: "0.0.8"
                values: |
                  nodeSelector:
                    node-role.kubernetes.io/control-plane: "true"
                  args:
                    - --v=2
                    - --cloud-provider=aws
                    - --cluster-cidr=10.244.0.0/16
                    - --allocate-node-cidrs=true
                    - --cluster-name=ekaz-test
              - name: aws-ebs-csi-driver
                namespace: kube-system
                chartname: aws-ebs-csi-driver/aws-ebs-csi-driver
                version: 2.33.0
                values: |
                  defaultStorageClass:
                    enabled: true
                  node:
                    kubeletPath: /var/lib/k0s/kubelet
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
      kind: AWSMachineTemplate
      name: ekaz-test-cp-mt
      namespace: default
---
# Source: aws-standalone-cp/templates/k0sworkerconfigtemplate.yaml
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: K0sWorkerConfigTemplate
metadata:
  name: ekaz-test-machine-config
spec:
  template:
    spec:
      version: v1.30.2+k0s.0
      args:
      - --enable-cloud-provider
      - --kubelet-extra-args="--cloud-provider=external"
---
# Source: aws-standalone-cp/templates/machinedeployment.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: ekaz-test-md
spec:
  clusterName: ekaz-test
  replicas: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: ekaz-test
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: ekaz-test
    spec:
      clusterName: ekaz-test
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: K0sWorkerConfigTemplate
          name: ekaz-test-machine-config
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachineTemplate
        name: ekaz-test-worker-mt
eromanova commented 1 month ago

UPD Initially, we used a kinda old AWS ccm image (v1.27.1). The issue is no longer present after the AWS ccm bump to v1.30.3. I think it'll be good to mention that k0smotron v1.0.2 requires the minimal AWS ccm version in docs.