kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.83k stars 4.64k forks source link

need help on getting my node join my cluster using kops #16665

Open maxime202400 opened 2 months ago

maxime202400 commented 2 months ago

i recently wanted to upgrade my version from 1.28.10 and during the upgrade some of the node are not joining the cluster

this is the error that iam seeing when i run the kops validate

Validating cluster

INSTANCE GROUPS NAME ROLE MACHINETYPE MIN MAX SUBNETS master-us-east-2a ControlPlane m7a.large 1 1 us-east-2a master-us-east-2b ControlPlane m7a.large 1 1 us-east-2b master-us-east-2c ControlPlane m7a.large 1 1 us-east-2c nodes Node m6a.large 3 18 us-east-2a,us-east-2b,us-east-2c

NODE STATUS NAME ROLE READY node True node True node True

VALIDATION ERRORS KIND NAME MESSAGE Machine machine "" has not yet joined cluster Machine machine "" has not yet joined cluster Machine machine "" has not yet joined cluster

Validation Failed Error: validation failed: cluster not yet healthy

and when i run the kubelet log on the probamatic node Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1/apis/storage.k8s.io/v1/csinodes/i-": dial tcp 127.0.0.1:443: connect: connection refused

kundan2707 commented 2 months ago

/kind support

SohamChakraborty commented 1 month ago

Hi, I am facing this exact problem.

kops version:

Client version: 1.29.2 (git-v1.29.2)

k8s version:

1.24.16

The error that I am seeing in /var/log/syslog of the master node is this:

Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.157295    3103 csi_plugin.go:1021] Failed to contact API server when waiting for CSINode publishing: Get "https://127.0.0.1/apis/storage.k8s.io/v1/csinodes/i-xxxx": dial tcp 127.0.0.1:443: connect: connection refused
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.220919    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.228123    3103 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234155    3103 kubelet_node_status.go:563] "Recording event message for node" node="i-xxxx" event="NodeHasSufficientMemory"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234212    3103 kubelet_node_status.go:563] "Recording event message for node" node="i-xxxx" event="NodeHasNoDiskPressure"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234233    3103 kubelet_node_status.go:563] "Recording event message for node" node="i-xxxx" event="NodeHasSufficientPID"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: I0808 19:26:20.234266    3103 kubelet_node_status.go:70] "Attempting to register node" node="i-xxxx"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.235047    3103 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://127.0.0.1/api/v1/nodes\": dial tcp 127.0.0.1:443: connect: connection refused" node="i-xxxx"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.321805    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.422894    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.523918    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.624965    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.725940    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.826678    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:20 ip-a-b-c-d kubelet[3103]: E0808 19:26:20.927813    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"
Aug  8 19:26:21 ip-a-b-c-d kubelet[3103]: E0808 19:26:21.028995    3103 kubelet.go:2427] "Error getting node" err="node \"i-xxxx\" not found"

I am using kops version 1.29.2 because I need to use the wildcard namespace feature for IRSA.

The cluster spec is here:

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 1
  name: k8s-124.foo.bar.com
spec:
  additionalPolicies:
    master: |
      [
        {
          "Effect": "Allow",
          "Action": ["ec2:ModifyInstanceAttribute"],
          "Resource": ["*"]
        }
      ]
  api:
    loadBalancer:
      class: Network
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: true
  channel: stable
  cloudLabels:
    App: k8s-124
    Env: foo
    Region: eu-west-1
  cloudProvider: aws
  clusterAutoscaler:
    awsUseStaticInstanceList: false
    balanceSimilarNodeGroups: false
    cpuRequest: 100m
    enabled: true
    expander: least-waste
    memoryRequest: 300Mi
    newPodScaleUpDelay: 0s
    scaleDownDelayAfterAdd: 10m0s
    scaleDownUnneededTime: 5m0s
    scaleDownUnreadyTime: 10m0s
    scaleDownUtilizationThreshold: "0.6"
    skipNodesWithLocalStorage: true
    skipNodesWithSystemPods: true
  configBase: s3://my-bucket/prefix
  dnsZone: xxxx
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-eu-west-1a
      name: a
    memoryRequest: 100Mi
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
    node:
    - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
  fileAssets:
  - content: |
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
      - level: Metadata
    name: audit-policy-config
    path: /srv/kubernetes/kube-apiserver/audit/policy-config.yaml
    roles:
    - Master
  - content: |
      apiVersion: v1
      kind: Config
      clusters:
      - name: bar
        cluster:
          server: https://audit-logs-receiver-endpoint/some-token
      contexts:
      - context:
          cluster: bar
          user: ""
        name: default-context
      current-context: default-context
      preferences: {}
      users: []
    name: audit-webhook-config
    path: /var/log/audit/webhook-config.yaml
    roles:
    - Master
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
    - aws:
        inlinePolicy: |-
          [
            {
              "Effect": "Allow",
              "Action": [
                "S3:*"
              ],
              "Resource": [
                "*"
              ]
            }
          ]
      name: s3perm
      namespace: '*'
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /srv/kubernetes/kube-apiserver/audit/policy-config.yaml
    auditWebhookBatchMaxWait: 5s
    auditWebhookConfigFile: /srv/kubernetes/kube-apiserver/audit/webhook-config.yaml
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    maxPods: 150
    shutdownGracePeriod: 1m0s
    shutdownGracePeriodCriticalPods: 30s
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.24.16
  masterPublicName: api.k8s-124.foo.bar.com
  networkCIDR: 10.8.0.0/16
  networkID: vpc-xxxx
  networking:
    cilium:
      hubble:
        enabled: true
  nonMasqueradeCIDR: 100.64.0.0/10
  podIdentityWebhook:
    enabled: true
  rollingUpdate:
    maxSurge: 4
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://oidc-bucket/k8s-1-24-2
    enableAWSOIDCProvider: true
  sshAccess:
  - 0.0.0.0/0
  sshKeyName: kops
  subnets:
  - cidr: 1.2.3.4/19
    id: subnet-xx
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 4.3.2.1/22
    id: subnet-yy
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  topology:
    dns:
      type: Private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-08-08T19:50:22Z"
  labels:
    kops.k8s.io/cluster: k8s-124.foo.bar.com
  name: master-eu-west-1a
spec:
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  rootVolumeEncryption: true
  rootVolumeSize: 30
  subnets:
  - eu-west-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-08-08T19:50:23Z"
  labels:
    kops.k8s.io/cluster: k8s-124.foo.bar.com
  name: nodes-eu-west-1a
spec:
  additionalUserData:
  - content: |
      apt-get update
      apt-get install -y qemu-user-static
    name: 0prereqs.sh
    type: text/x-shellscript
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/k8s-124.foo.bar.com: ""
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20240411
  instanceMetadata:
    httpPutResponseHopLimit: 2
    httpTokens: required
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-eu-west-1a
  role: Node
  rootVolumeEncryption: true
  rootVolumeSize: 200
  subnets:
  - eu-west-1a
hakman commented 1 month ago

@SohamChakraborty could you check kube-apiserver.log file for hints on the issue?

SohamChakraborty commented 3 weeks ago

Hi @hakman I have identified my issue. It was having some sort of problem with audit policy and audit webhook config files.