outscale / cluster-api-provider-outscale

BSD 3-Clause "New" or "Revised" License
4 stars 10 forks source link

[Bug]: impossible update to 1.23.8 #305

Closed pierreozoux closed 8 months ago

pierreozoux commented 9 months ago

What happened

I try to upgrade from 1.22 to 1.23.8.

I get this error:

cannot upgrade to a Kubernetes/kubeadm version which is using the old default registry. Please use a newer Kubernetes patch release which is using the new default registry > 1.23.15

Issue is described here: https://github.com/kubernetes/kubeadm/issues/2671

Step to reproduce

Use the omi provided which is 1.23.8 to upgrade from 1.22 to 1.23.

Expected to happen

I should be able to upgrade.

Add anything

I tried to build it myself without success.

https://github.com/outscale/cluster-api-provider-outscale/issues/304

If you could create on omi for me in 1.23.17, this would be amazing :)

Thanks for your help!

cluster-api output

,

Environment

- Kubernetes version: (use `kubectl version`): .
- OS (e.g. from `/etc/os-release`):
- Kernel (e.g. `uname -a`):
- cluster-api-provider-outscale version:
- cluster-api version: 
- Install tools:
- Kubernetes Distribution:
- Kubernetes Diestribution version:
ghost commented 9 months ago

Hi @pierreozoux ,

A new image ubuntu-2204-2204-kubernetes-v1.23.17-2024-01-05 is now available on cloudgouv-eu-west-1.

outscale@ip-10-9-39-194:~/test-caposc/cluster-api-provider-outscale/example$ kubectl get machine -A
NAMESPACE   NAME                                CLUSTER          NODENAME                                             PROVIDERID                               PHASE      AGE     VERSION
default     cluster-api-control-plane-76dwz     cluster-api      ip-10-0-4-74.cloudgouv-eu-west-1.compute.internal    aws:///cloudgouv-eu-west-1a/i-a2aa852e   Running    8m54s   v1.23.17
default     cluster-api-md-0-5ff5c57d5d-9qqn2   cluster-api      ip-10-0-3-173.cloudgouv-eu-west-1.compute.internal   aws:///cloudgouv-eu-west-1a/i-a80c0037   Running    9m7s    v1.23.17
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: cluster-api
  namespace: default
  labels:
    cni: "cluster-api-crs-cni"
    ccm: "cluster-api-crs-ccm"
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["10.42.0.0/16"]
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: OscCluster
    name: cluster-api
    namespace: default
  controlPlaneRef:
    kind: KubeadmControlPlane
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    name: "cluster-api-control-plane"
    namespace: default
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OscCluster
metadata:
  name: cluster-api
  namespace: default
spec:
  network:
    clusterName: cluster-api
    subregionName: cloudgouv-eu-west-1a
    loadBalancer:
      loadbalancername: osc-k8s
      loadbalancertype: internet-facing
      subnetname: cluster-api-subnet-public
      securitygroupname: cluster-api-securitygroup-lb
      clusterName: cluster-api
    net:
      name: cluster-api-net
      clusterName: cluster-api
      ipRange: "10.0.0.0/16"
    subnets:
      - name: cluster-api-subnet-kcp
        ipSubnetRange: "10.0.4.0/24"
      - name: cluster-api-subnet-kw
        ipSubnetRange: "10.0.3.0/24"
      - name: cluster-api-subnet-public
        ipSubnetRange: "10.0.2.0/24"
    publicIps:
      - name: cluster-api-publicip-nat
    internetService:
      clusterName: cluster-api
      name: cluster-api-internetservice
    natService:
      clusterName: cluster-api
      name: cluster-api-natservice
      publicipname: cluster-api-publicip-nat
      subnetname: cluster-api-subnet-public
    bastion:
      clusterName: cluster-api
      enable: false
    routeTables:
      - name: cluster-api-routetable-kw
        subnets:
          - cluster-api-subnet-kw
        routes:
          - name: cluster-api-routes-kw
            targetName: cluster-api-natservice
            targetType: nat
            destination: "0.0.0.0/0"
      - name: cluster-api-routetable-kcp
        subnets:
          - cluster-api-subnet-kcp
        routes:
          - name: cluster-api-routes-kcp
            targetName: cluster-api-natservice
            targetType: nat
            destination: "0.0.0.0/0"
      - name: cluster-api-routetable-public
        subnets:
          - cluster-api-subnet-public
        routes:
          - name: cluster-api-routes-public
            targetName: cluster-api-internetservice
            targetType: gateway
            destination: "0.0.0.0/0"
    securityGroups:
      - name: cluster-api-securitygroups-kw
        description: Security Group with cluster-api   
        securityGroupRules:
          - name: cluster-api-securitygrouprule-api-kubelet-kw
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.3.0/24"
            fromPortRange: 10250
            toPortRange: 10250
          - name: cluster-api-securitygrouprule-api-kubelet-kcp
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.4.0/24"
            fromPortRange: 10250
            toPortRange: 10250
          - name: cluster-api-securitygrouprule-kcp-nodeip-kw
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.3.0/24"
            fromPortRange: 30000
            toPortRange: 32767
          - name: cluster-api-securitygrouprule-kcp-nodeip-kcp
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.4.0/24"
            fromPortRange: 30000
            toPortRange: 32767
          - name: cluster-api-securitygrouprule-kw-bgp
            flow: Inbound
            ipProtocol: tcp
            ipRange: "10.0.0.0/16"
            fromPortRange: 179
            toPortRange: 179
      - name: cluster-api-securitygroups-kcp
        description: Security Group with cluster-api
        securityGroupRules:
          - name: cluster-api-securitygrouprule-api-kw
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.3.0/24"
            fromPortRange: 6443
            toPortRange: 6443
          - name: cluster-api-securitygrouprule-api-kcp
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.4.0/24"
            fromPortRange: 6443
            toPortRange: 6443
          - name: cluster-api-securitygrouprule-etcd
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.4.0/24"
            fromPortRange: 2378
            toPortRange: 2379
          - name: cluster-api-securitygrouprule-kubelet-kcp
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.4.0/24"
            fromPortRange: 10250
            toPortRange: 10252
          - name: cluster-api-securitygrouprule-kcp-bgp
            flow: Inbound
            ipProtocol: tcp
            ipRange: "10.0.0.0/16"
            fromPortRange: 179
            toPortRange: 179
          - name: cluster-api-securitygrouprule-kw-nodeip-kw
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.3.0/24"
            fromPortRange: 30000
            toPortRange: 32767
          - name: cluster-api-securitygrouprule-kw-nodeip-kcp
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "10.0.4.0/24"
            fromPortRange: 30000
            toPortRange: 32767
      - name: cluster-api-securitygroup-lb
        description: Security Group lb with cluster-api
        securityGroupRules:
          - name: cluste-api-securitygrouprule-lb
            flow: Inbound
            ipProtocol: tcp
# IpRange to authorize access to kubernetes endpoints (kube-apiserver), you must keep it and change it with a CIDR that best suits with your environment.
            ipRange: "0.0.0.0/0"
            fromPortRange: 6443
            toPortRange: 6443
      - name: cluster-api-securitygroups-node
        description: Security Group node with cluster-api
        tag: OscK8sMainSG
        securityGroupRules:
          - name: cluster-api-securitygrouprule-calico-vxlan
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 4789
            toPortRange: 4789
          - name: cluster-api-securitygrouprule-calico-typha
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 5473
            toPortRange: 5473
          - name: cluster-api-securitygrouprule-calico-wireguard
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 51820
            toPortRange: 51820
          - name: cluster-api-securitygrouprule-calico-wireguard-ipv6
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 51821
            toPortRange: 51821
          - name: cluster-api-securitygrouprule-flannel
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 4789
            toPortRange: 4789
          - name: cluster-api-securitygrouperule-flannel-udp
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 8285
            toPortRange: 8285
          - name: cluster-api-securitygroup-flannel-vxlan
            flow: Inbound
            ipProtocol: udp
            ipRange: "10.0.0.0/16"
            fromPortRange: 8472
            toPortRange: 8472
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: "cluster-api-md-0"
  namespace: default
spec:
  clusterName: "cluster-api"
  replicas: 1
  selector:
    matchLabels:
  template:
    spec:
      clusterName: "cluster-api"
      version: "1.23.17"
      bootstrap:
        configRef:
          name: "cluster-api-md-0"
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          namespace: default
      infrastructureRef:
        name: "cluster-api-md-0"
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: OscMachineTemplate
        namespace: default
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OscMachineTemplate
metadata:
  name: "cluster-api-md-0"
  namespace: default
spec:
  template:
    spec:
      node:
        clusterName: cluster-api
        image:
          name: ubuntu-2204-2204-kubernetes-v1.23.17-2024-01-05
        keypair:
          name: cluster-api
          deleteKeypair: false
        vm:
          clusterName: cluster-api
          name: cluster-api-vm-kw
          keypairName: cluster-api
          deviceName: /dev/sda1
          rootDisk:
            rootDiskSize: 30
            rootDiskIops: 1500
            rootDiskType: gp2
          subnetName: cluster-api-subnet-kw
          subregionName: cloudgouv-eu-west-1a
          securityGroupNames:
            - name: cluster-api-securitygroups-kw
            - name: cluster-api-securitygroups-node
          vmType: "tinav6.c4r8p2"
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OscMachineTemplate
metadata:
  name: "cluster-api-control-plane"
  namespace: default
spec:
  template:
    spec:
      node:
        clusterName: cluster-api
        image:
         name: ubuntu-2204-2204-kubernetes-v1.23.17-2024-01-05
        keypair:
          name: cluster-api
          deleteKeypair: false
        vm:
          clusterName: cluster-api
          name: cluster-api-vm-kcp
          keypairName: cluster-api
          rootDisk:
            rootDiskSize: 30
            rootDiskIops: 1500
            rootDiskType: gp2
          deviceName: /dev/sda1
          subregionName: cloudgouv-eu-west-1a
          subnetName: cluster-api-subnet-kcp
          role: controlplane
          loadBalancerName: osc-k8s
          securityGroupNames:
            - name: cluster-api-securitygroups-kcp
            - name: cluster-api-securitygroups-node
          vmType: "tinav6.c4r8p1"
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: "cluster-api-md-0"
  namespace: default
spec:
  template:
    spec:
      files:
      - content: |
          #!/bin/sh

          curl https://github.com/opencontainers/runc/releases/download/v1.1.1/runc.amd64 -Lo /tmp/runc.amd64
          chmod +x /tmp/runc.amd64
          cp -f /tmp/runc.amd64 /usr/local/sbin/runc
        owner: root:root
        path: /tmp/set_runc.sh
        permissions: "0744"
      joinConfiguration:
        nodeRegistration:
          name: "{{ ds.meta_data.local_hostname }}"
          kubeletExtraArgs:
            cloud-provider: external
            provider-id: aws:///'{{ ds.meta_data.placement.availability_zone }}'/'{{ ds.meta_data.instance_id }}'
      preKubeadmCommands:
        - sh /tmp/set_runc.sh
---
kind: KubeadmControlPlane
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
metadata:
  name: "cluster-api-control-plane"
spec:
  replicas: 1
  machineTemplate:
    infrastructureRef:
      kind: OscMachineTemplate
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      name: "cluster-api-control-plane"
      namespace: default
  kubeadmConfigSpec:
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: external
          provider-id: aws:///'{{ ds.meta_data.placement.availability_zone }}'/'{{ ds.meta_data.instance_id }}'
        name: '{{ ds.meta_data.local_hostname }}'
    files:
    - content: |
        #!/bin/sh
        curl https://github.com/opencontainers/runc/releases/download/v1.1.1/runc.amd64 -Lo /tmp/runc.amd64
        chmod +x /tmp/runc.amd64    
        cp -f /tmp/runc.amd64 /usr/local/sbin/runc
      owner: root:root
      path: /tmp/set_runc.sh
      permissions: "0744"
    joinConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: external
          provider-id: aws:///'{{ ds.meta_data.placement.availability_zone }}'/'{{ ds.meta_data.instance_id }}'
    preKubeadmCommands:
      - sh /tmp/set_runc.sh
  version: "1.23.17"
ghost commented 9 months ago

a new image ubuntu-2204-2204-kubernetes-v1.24.16-2024-01-05 is now available on cloud-gouv

ghost commented 8 months ago

New images ubuntu-2204-2204-kubernetes-v1.28.5-2024-01-08, ubuntu-2204-2204-kubernetes-v1.27.9-2024-01-08, ubuntu-2204-2204-kubernetes-v1.26.12-2024-01-08, ubuntu-2204-2204-kubernetes-v1.25.16-2024-01-08 are now available on cloud-gouv

pierreozoux commented 8 months ago

Thanks a lot, I really appreciate your help!

Currenlty waiting to test, waiting for capacity in cloudgouv :)

pierreozoux commented 8 months ago

Ok, I just tried to upgrade from 1.22 to 1.23.17 with your new image, I get this log in the UI console output of the VM:

[  253.407041] cloud-init[953]: [2024-01-09 10:08:08] [preflight] Running pre-flight checks before initializing the new control plane instance
[  253.408232] cloud-init[953]: [2024-01-09 10:08:08] [preflight] Pulling images required for setting up a Kubernetes cluster
[  253.409308] cloud-init[953]: [2024-01-09 10:08:08] [preflight] This might take a minute or two, depending on the speed of your internet connection
[  253.410544] cloud-init[953]: [2024-01-09 10:08:08] [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[  253.411766] cloud-init[953]: [2024-01-09 10:09:53] error execution phase preflight: [preflight] Some fatal errors occurred:
[  253.412866] cloud-init[953]: [2024-01-09 10:09:53]   [ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns:v1.8.6: output: E0109 10:09:53.627558    1136 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = NotFound desc = failed to pull and unpack image \"k8s.gcr.io/coredns:v1.8.6\": failed to resolve reference \"k8s.gcr.io/coredns:v1.8.6\": k8s.gcr.io/coredns:v1.8.6: not found" image="k8s.gcr.io/coredns:v1.8.6"
[  253.416342] cloud-init[953]: [2024-01-09 10:09:53] time="2024-01-09T10:09:53Z" level=fatal msg="pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \"k8s.gcr.io/coredns:v1.8.6\": failed to resolve reference \"k8s.gcr.io/coredns:v1.8.6\": k8s.gcr.io/coredns:v1.8.6: not found"
[  253.418727] cloud-init[953]: [2024-01-09 10:09:53] , error: exit status 1
[  253.419421] cloud-init[953]: [2024-01-09 10:09:53] [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
[  253.420780] cloud-init[953]: [2024-01-09 10:09:53] To see the stack trace of this error execute with --v=5 or higher
[  253.421787] cloud-init[953]: [2024-01-09 10:09:53] 2024-01-09 10:09:53,634 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  253.423304] cloud-init[953]: [2024-01-09 10:09:53] 2024-01-09 10:09:53,635 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
[  253.425287] cloud-init[953]: [2024-01-09 10:09:55] Cloud-init v. 22.2-144-g3e35fb84-1~bddeb finished at Tue, 09 Jan 2024 10:09:55 +0000. Datasource DataSourceOutscale.  Up 253.33 seconds

I guess we can change this version somewhere? Is it more at the cluster level, or at the omi level?

Thanks again for your help!

ghost commented 8 months ago

Hi, I found this https://github.com/kubernetes/kubeadm/issues/2761

ghost commented 8 months ago

I can recreate a new omi and switch to k8s.gcr.io as it is doing in the doc. It seems that image-builder switch to registry.k8s.io https://github.com/kubernetes-sigs/image-builder/commit/1b26ed7b1c5fa0b71a00c130d07db5cf6e026f74

pierreozoux commented 8 months ago

This would be amazing :) thanks for your help!

ghost commented 8 months ago

Hi @pierreozoux I create a new omi ubuntu-2204-2204-kubernetes-v1.23.17-2024-01-12 based on the current omi ubuntu-2204-2204-kubernetes-v1.23.17-2024-01-05.

cat /etc/kubeadm.yml 
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
imageRepository: k8s.gcr.io
kubernetesVersion: v1.23.17
dns:
  imageRepository: registry.k8s.io/coredns
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
  criSocket: /var/run/containerd/containerd.sock
kubeadm config images pull --config /etc/kubeadm.yml --cri-socket /var/run/containerd/containerd.sock
[config/images] Pulled k8s.gcr.io/kube-apiserver:v1.23.17
[config/images] Pulled k8s.gcr.io/kube-controller-manager:v1.23.17
[config/images] Pulled k8s.gcr.io/kube-scheduler:v1.23.17
[config/images] Pulled k8s.gcr.io/kube-proxy:v1.23.17
[config/images] Pulled k8s.gcr.io/pause:3.6
[config/images] Pulled k8s.gcr.io/etcd:3.5.6-0
[config/images] Pulled registry.k8s.io/coredns/coredns:v1.8.6
root@ip-10-8-0-100:/home/outscale# crictl images
IMAGE                                     TAG                 IMAGE ID            SIZE
k8s.gcr.io/etcd                           3.5.6-0             fce326961ae2d       103MB
registry.k8s.io/etcd                      3.5.6-0             fce326961ae2d       103MB
k8s.gcr.io/kube-apiserver                 v1.23.17            62bc5d8258d67       33.8MB
registry.k8s.io/kube-apiserver            v1.23.17            62bc5d8258d67       33.8MB
k8s.gcr.io/kube-controller-manager        v1.23.17            1dab4fc7b6e0d       31.2MB
registry.k8s.io/kube-controller-manager   v1.23.17            1dab4fc7b6e0d       31.2MB
k8s.gcr.io/kube-proxy                     v1.23.17            f21c8d21558c8       39.7MB
registry.k8s.io/kube-proxy                v1.23.17            f21c8d21558c8       39.7MB
k8s.gcr.io/kube-scheduler                 v1.23.17            bc6794cb54ac5       15.7MB
registry.k8s.io/kube-scheduler            v1.23.17            bc6794cb54ac5       15.7MB
k8s.gcr.io/pause                          3.6                 6270bb605e12e       302kB
registry.k8s.io/pause                     3.6                 6270bb605e12e       302kB
registry.k8s.io/coredns/coredns           v1.8.6              a4ca41631cc7a       13.6MB

Which version of clusterctl do you used ?

(For me for example, i used this version for old version v1.3.10 of clusterctl based on the compatibility matrix)

Can try the upgrade with this omi ?

pierreozoux commented 8 months ago

Testing, but I think this image is not available in cloudgouv.

ghost commented 8 months ago

Sorry now it is a public omi on cloudgouv. I change its privacy

pierreozoux commented 8 months ago

I tried all day yesterday, I kinda managed, but I still have an error.

First clusterctl version, as requested:

clusterctl version
clusterctl version: &version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.10", GitCommit:"3f7ccb8089aab19ff13db88e5f0897cfe1dee355", GitTreeState:"clean", BuildDate:"2023-07-25T15:47:44Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}
bootstrap-kubeadm         capi-kubeadm-bootstrap-system          BootstrapProvider        v1.3.10
control-plane-kubeadm     capi-kubeadm-control-plane-system      ControlPlaneProvider     v1.3.10
cluster-api               capi-system                            CoreProvider             v1.3.10
infrastructure-outscale   cluster-api-provider-outscale-system   InfrastructureProvider   v0.1.4

For the upgrade, I did k edit kubeadmcontrolplane to change the following fields (machineTemplate name and version):

  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: OscMachineTemplate
      name: test-upgrade-control-plane-1-23
      namespace: test-upgrade
    metadata: {}
  replicas: 3
  rolloutStrategy:
    rollingUpdate:
      maxSurge: 1
    type: RollingUpdate
  version: v1.23.17

I also edited the cm, in the workload cluster to add imageRepo for coreDNS, as you mentionned:

k get cm kubeadm-config -o yaml
apiVersion: v1
data:
  ClusterConfiguration: |
    apiServer:
      extraArgs:
        authorization-mode: Node,RBAC
      timeoutForControlPlane: 4m0s
    apiVersion: kubeadm.k8s.io/v1beta3
    certificatesDir: /etc/kubernetes/pki
    clusterName: test-upgrade
    controlPlaneEndpoint: test-upgrade-k8s-xxx.cloudgouv-eu-west-1.lbu.outscale.com:6443
    controllerManager: {}
    dns:
      imageRepository: registry.k8s.io/coredns

I didn't know where to configure this part you mentionned:

apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
  criSocket: /var/run/containerd/containerd.sock

After edit kubeadmcontrolplane, it started a new VM, all was going nicely. The node joined etcd and so on and was marked ready in kubernetes api. A node was deleted, and a new node was created, same, it was going nicely, the node was marked ready. But then, the first updated node was marked as NotReady. I did the same opeation several time, I deleted the NotReady node, it was recreated, and always the before the last node would be marked as NotReady after a while. Then I rebooted one of the NotReady node, to inspect it. It would have the same behavior, start, and after a short while, become NotRead again. SSH became unavaible as well.

here are the logs I found, first it looks like cloud-init was in a kind of a loop:

Jan 16 11:17:32 ip-10-0-0-36 audit: PROCTITLE proctitle="/usr/local/bin/containerd"
Jan 16 11:17:32 ip-10-0-0-36 audit[657]: SYSCALL arch=c000003e syscall=257 success=yes exit=93 a0=ffffffffffffff9c a1=c000d416e0 a2=80000 a3=0 items=1 ppid=1 pid=657 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="containerd" exe="/usr/local/bin/containerd" subj=unconfined key="containerd"
Jan 16 11:17:32 ip-10-0-0-36 audit: CWD cwd="/"
Jan 16 11:17:32 ip-10-0-0-36 audit: PATH item=0 name="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/998/fs/usr/share" inode=785775 dev=fc:01 mode=040755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
Jan 16 11:17:32 ip-10-0-0-36 audit: PROCTITLE proctitle="/usr/local/bin/containerd"
Jan 16 11:17:32 ip-10-0-0-36 audit[657]: SYSCALL arch=c000003e syscall=257 success=yes exit=93 a0=ffffffffffffff9c a1=c000d542a0 a2=80000 a3=0 items=1 ppid=1 pid=657 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="containerd" exe="/usr/local/bin/containerd" subj=unconfined key="containerd"
Jan 16 11:17:32 ip-10-0-0-36 audit: CWD cwd="/"
Jan 16 11:17:32 ip-10-0-0-36 audit: PATH item=0 name="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/998/fs/usr/share/ca-certificates" inode=785776 dev=fc:01 mode=040755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
Jan 16 11:17:32 ip-10-0-0-36 audit: PROCTITLE proctitle="/usr/local/bin/containerd"
Jan 16 11:17:35 ip-10-0-0-36 kubelet[644]: I0116 11:17:35.535025     644 log.go:198] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/github.com/emicklei/go-restful.(*Response).WriteHeader (response.go:220)
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]: 2024-01-16 11:17:41,273 - hotplug_hook.py[ERROR]: Received fatal exception handling hotplug!
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]: Traceback (most recent call last):
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 277, in handle_args
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     handle_hotplug(
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 235, in handle_hotplug
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     raise last_exception
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 224, in handle_hotplug
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     event_handler.detect_hotplugged_device()
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 104, in detect_hotplugged_device
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     raise RuntimeError(
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]: RuntimeError: Failed to detect aa:17:dc:e4:6a:8d in updated metadata
Jan 16 11:17:41 ip-10-0-0-36 cloud-init[2589]: [CLOUDINIT]2024-01-16 11:17:41,273 - hotplug_hook.py[ERROR]: Received fatal exception handling hotplug!
                                               Traceback (most recent call last):
                                                 File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 277, in handle_args
                                                   handle_hotplug(
                                                 File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 235, in handle_hotplug
                                                   raise last_exception
                                                 File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 224, in handle_hotplug
                                                   event_handler.detect_hotplugged_device()
                                                 File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 104, in detect_hotplugged_device
                                                   raise RuntimeError(
                                               RuntimeError: Failed to detect aa:17:dc:e4:6a:8d in updated metadata
Jan 16 11:17:41 ip-10-0-0-36 cloud-init[2589]: [CLOUDINIT]2024-01-16 11:17:41,274 - handlers.py[DEBUG]: finish: hotplug-hook: FAIL: Handle reconfiguration on hotplug events.
Jan 16 11:17:41 ip-10-0-0-36 cloud-init[2589]: [CLOUDINIT]2024-01-16 11:17:41,274 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 cloud-init[2589]: [CLOUDINIT]2024-01-16 11:17:41,274 - util.py[DEBUG]: Read 14 bytes from /proc/uptime
Jan 16 11:17:41 ip-10-0-0-36 cloud-init[2589]: [CLOUDINIT]2024-01-16 11:17:41,274 - util.py[DEBUG]: cloud-init mode 'hotplug-hook' took 76.643 seconds (76.64)
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]: Traceback (most recent call last):
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/bin/cloud-init", line 11, in <module>
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     load_entry_point('cloud-init==22.2', 'console_scripts', 'cloud-init')()
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 1088, in main
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     retval = util.log_time(
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2621, in log_time
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     ret = func(*args, **kwargs)
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 277, in handle_args
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     handle_hotplug(
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 235, in handle_hotplug
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     raise last_exception
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 224, in handle_hotplug
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     event_handler.detect_hotplugged_device()
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/devel/hotplug_hook.py", line 104, in detect_hotplugged_device
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]:     raise RuntimeError(
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2589]: RuntimeError: Failed to detect aa:17:dc:e4:6a:8d in updated metadata
Jan 16 11:17:41 ip-10-0-0-36 systemd[1]: cloud-init-hotplugd.service: Main process exited, code=exited, status=1/FAILURE
Jan 16 11:17:41 ip-10-0-0-36 systemd[1]: cloud-init-hotplugd.service: Failed with result 'exit-code'.
Jan 16 11:17:41 ip-10-0-0-36 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=cloud-init-hotplugd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Jan 16 11:17:41 ip-10-0-0-36 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=cloud-init-hotplugd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jan 16 11:17:41 ip-10-0-0-36 systemd[1]: Started cloud-init hotplug hook daemon.
Jan 16 11:17:41 ip-10-0-0-36 cloud-init-hotplugd[2606]: args=--subsystem=net handle --devpath=/devices/virtual/net/cilium_vxlan --udevaction=add
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,659 - hotplug_hook.py[DEBUG]: hotplug-hook called with the following arguments: {hotplug_action: handle, subsystem: net, udevaction: add, devpath: /devices/virtual/net/cilium_vxlan}
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,659 - handlers.py[DEBUG]: start: hotplug-hook: Handle reconfiguration on hotplug events.
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,659 - hotplug_hook.py[DEBUG]: Fetching datasource
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,660 - handlers.py[DEBUG]: start: hotplug-hook/check-cache: attempting to read from cache [trust]
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,660 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,660 - util.py[DEBUG]: Read 50731 bytes from /var/lib/cloud/instance/obj.pkl
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,662 - util.py[DEBUG]: Reading from /run/cloud-init/.instance-id (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,663 - util.py[DEBUG]: Read 11 bytes from /run/cloud-init/.instance-id
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,663 - stages.py[DEBUG]: restored from cache with run check: DataSourceOutscale
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,663 - handlers.py[DEBUG]: finish: hotplug-hook/check-cache: SUCCESS: restored from cache with run check: DataSourceOutscale
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,663 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,663 - util.py[DEBUG]: Read 3774 bytes from /etc/cloud/cloud.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,663 - util.py[DEBUG]: Attempting to load yaml from string of length 3774 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,671 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/99_metadata.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,671 - util.py[DEBUG]: Read 59 bytes from /etc/cloud/cloud.cfg.d/99_metadata.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,671 - util.py[DEBUG]: Attempting to load yaml from string of length 59 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,671 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/90_dpkg.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,671 - util.py[DEBUG]: Read 320 bytes from /etc/cloud/cloud.cfg.d/90_dpkg.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,671 - util.py[DEBUG]: Attempting to load yaml from string of length 320 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/20_user.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Read 48 bytes from /etc/cloud/cloud.cfg.d/20_user.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Attempting to load yaml from string of length 48 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/06_hotplug.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Read 50 bytes from /etc/cloud/cloud.cfg.d/06_hotplug.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Attempting to load yaml from string of length 50 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Reading from /etc/cloud/cloud.cfg.d/05_logging.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,673 - util.py[DEBUG]: Read 2109 bytes from /etc/cloud/cloud.cfg.d/05_logging.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,674 - util.py[DEBUG]: Attempting to load yaml from string of length 2109 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,676 - util.py[DEBUG]: Reading from /run/cloud-init/cloud.cfg (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,676 - util.py[DEBUG]: Read 36 bytes from /run/cloud-init/cloud.cfg
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,676 - util.py[DEBUG]: Attempting to load yaml from string of length 36 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,677 - util.py[DEBUG]: Attempting to load yaml from string of length 0 with allowed root types (<class 'dict'>,)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,677 - util.py[DEBUG]: loaded blob returned None, returning default.
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,677 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/cloud-config.txt (quiet=False)
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,677 - util.py[DEBUG]: Read 14222 bytes from /var/lib/cloud/instance/cloud-config.txt
Jan 16 11:17:41 ip-10-0-0-36 bash[2606]: [CLOUDINIT]2024-01-16 11:17:41,677 - util.py[DEBUG]: Attempting to load yaml from string of length 14222 with allowed root types (<class 'dict'>,)

and here are the last logs:

Jan 16 11:21:40 ip-10-0-0-36 cloud-init[2640]: [CLOUDINIT]2024-01-16 11:21:40,327 - subp.py[DEBUG]: Running command ['systemctl', 'is-enabled', 'NetworkManager.service'] with allowed return codes [0] (shell=False, capture=True)
Jan 16 11:21:40 ip-10-0-0-36 cloud-init[2640]: [CLOUDINIT]2024-01-16 11:21:40,331 - activators.py[DEBUG]: Using selected activator: <class 'cloudinit.net.activators.NetplanActivator'> from priority: ['netplan', 'eni', 'network-manager', 'networkd']
Jan 16 11:21:40 ip-10-0-0-36 cloud-init[2640]: [CLOUDINIT]2024-01-16 11:21:40,331 - activators.py[DEBUG]: Calling 'netplan apply' rather than altering individual interfaces
Jan 16 11:21:40 ip-10-0-0-36 cloud-init[2640]: [CLOUDINIT]2024-01-16 11:21:40,331 - activators.py[DEBUG]: Attempting command ['netplan', 'apply'] for device all
Jan 16 11:21:40 ip-10-0-0-36 cloud-init[2640]: [CLOUDINIT]2024-01-16 11:21:40,331 - subp.py[DEBUG]: Running command ['netplan', 'apply'] with allowed return codes [0] (shell=False, capture=True)
Jan 16 11:21:40 ip-10-0-0-36 systemd[1]: Reloading.
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=18 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=50 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=12 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=26 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=29 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=10 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=38 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=54 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=6 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=3 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=952 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=58 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=9 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=11 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=46 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=42 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=22 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=84 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=34 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 systemd[1]: Configuration file /etc/systemd/system/cloud-config.service.d/boot-order.conf is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 16 11:21:40 ip-10-0-0-36 systemd[1]: Configuration file /etc/systemd/system/cloud-final.service.d/boot-order.conf is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=956 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=957 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=958 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=959 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=960 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=13 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=14 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=961 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=962 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=963 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=964 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=965 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=966 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=967 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=968 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=7 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=8 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=969 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=970 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=971 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=4 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=5 op=UNLOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=972 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=973 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=974 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=975 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=976 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=977 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=978 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=979 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 audit: BPF prog-id=980 op=LOAD
Jan 16 11:21:40 ip-10-0-0-36 systemd-udevd[405]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Jan 16 11:21:40 ip-10-0-0-36 systemd-udevd[405]: Configuration file /etc/udev/rules.d/90-etcd-tuning.rules is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 16 11:21:40 ip-10-0-0-36 systemd-udevd[2704]: Using default interface naming scheme 'v249'.
Jan 16 11:21:40 ip-10-0-0-36 systemd-udevd[2709]: Using default interface naming scheme 'v249'.

Do you have an idea of what is happening?

Do you know where should I put this config:

---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
  criSocket: /var/run/containerd/containerd.sock

And do you think it is related to my issue?

Thanks for your help!

ghost commented 8 months ago

@pierreozoux I will update soon omi on us-east-2, eu-west-2 and cloudgouv (1.25.x, ..., 1.28.x) I will also create a doc to upgrade from 1.24 to 1.28 with the update of cluster-api controllers.

pierreozoux commented 8 months ago

And from 1.22 ? :see_no_evil:

ghost commented 8 months ago

You are right, i will add 1.22 for ubuntu 22.04.

ghost commented 8 months ago

Hi, Thanks for the logs.

First, if you see your node as notReady it is already a first step.

ghost commented 8 months ago

From your logs, it seems you have trouble with hotplug.

So i will create a new omi with 1.23.7 on ubuntu 22.04 based on the current omi and deactivate hotplug and try to recreate a cluster with the cni cillium.

Remove hotplug from array when /etc/cloud/cloud.cfg.d/06_hotplug.cfg:

updates:
  network:
    when: ["boot"] 
ghost commented 8 months ago

Command that is usefull to debug cloudinit by the way https://stackoverflow.com/questions/23065673/how-to-re-run-cloud-init-without-reboot

ghost commented 8 months ago

Hi @pierreozoux , How are you ? I create a new omi with kubernetes 1.23.7 and without hotplug on ubuntu 22.04 and it works with cillium. (I guess you used cilliuim, right ?) and it works with cillium. It is now a public omi on cloud-gouv (ubuntu-2204-2204-kubernetes-v1.23.17-2024-01-14). Can you try with this omi ? I will create new omi for ubuntu 22.04 from 1.22 to 1.28.

outscale@ip-10-9-39-194:~/test-cluster/calico/osc-k8s-rke-cluster/addons/ccm$ kubectl get nodes -A
NAME                                                 STATUS   ROLES                  AGE   VERSION
ip-10-0-3-4.cloudgouv-eu-west-1.compute.internal     Ready    <none>                 15h   v1.23.17
ip-10-0-4-149.cloudgouv-eu-west-1.compute.internal   Ready    control-plane,master   15h   v1.23.17
outscale@ip-10-9-39-194:~/test-cluster/calico/osc-k8s-rke-cluster/addons/ccm$ kubectl get pod -A
NAMESPACE     NAME                                                                         READY   STATUS    RESTARTS   AGE
kube-system   cilium-n9nwg                                                                 1/1     Running   0          4m52s
kube-system   cilium-operator-69b9cd7d88-nv2vx                                             1/1     Running   0          4m52s
kube-system   cilium-q62xf                                                                 1/1     Running   0          4m52s
kube-system   coredns-bd6b6df9f-nnhxf                                                      1/1     Running   0          15h
kube-system   coredns-bd6b6df9f-sngv4                                                      1/1     Running   0          15h
kube-system   etcd-ip-10-0-4-149.cloudgouv-eu-west-1.compute.internal                      1/1     Running   0          15h
kube-system   kube-apiserver-ip-10-0-4-149.cloudgouv-eu-west-1.compute.internal            1/1     Running   0          15h
kube-system   kube-controller-manager-ip-10-0-4-149.cloudgouv-eu-west-1.compute.internal   1/1     Running   0          15h
kube-system   kube-proxy-fszrb                                                             1/1     Running   0          15h
kube-system   kube-proxy-l24tq                                                             1/1     Running   0          15h
kube-system   kube-scheduler-ip-10-0-4-149.cloudgouv-eu-west-1.compute.internal            1/1     Running   0          15h
kube-system   osc-cloud-controller-manager-bdmkw                                           1/1     Running   0          20s
kube-system   osc-cloud-controller-manager-bvq45                                           1/1     Running   0          19s
ghost commented 8 months ago

Hi @pierreozoux , I create the following omi on cloudgouv with hotplug deactiavation:

Can you try your upgrade using the following omi ?

pierreozoux commented 8 months ago

Oui, nous utilisons cilium, merci, je vais tester cela tout de suite!

pierreozoux commented 8 months ago

I managed to upgrade controlplane to 1.23.17, and it seems stable \o/ I'll try to finish my upgrade during the day. and report here!

Thanks again for your dedicated support!

ghost commented 8 months ago

Hi @pierreozoux , you're welcome.

By the way, i add a doc about upgrade ( specially for changing cluster-api controlllers version.) https://github.com/outscale/cluster-api-provider-outscale/blob/main/docs/src/topics/upgrade-cluster.md

pierreozoux commented 8 months ago

Ok, I managed to update also the nodes to 1.23.17.

Then I had an issue to 1.24, I modified k edit cm kubeadm-config:

    imageRepository: registry.k8s.io

And then it worked, I continue my testing to 1.25, 1.26, I'll report it here to let you know!

pierreozoux commented 8 months ago

Ok, I managed to update to 1.26.

On the update to 1.26, there was a micro downtime of the control plane, but after some minutes, it was back online. I'd say, that it is good enough.

I'm now planning production upgrade.

I'll come here to report how it goes!

Thanks again for your help!

I'll close the ticket.

ghost commented 8 months ago

@pierreozoux Great news 👍

pierreozoux commented 8 months ago

I updated prod to 1.26,12, really happy! thanks!

ghost commented 8 months ago

Hi @pierreozoux Congrats. By the way, how do you use cillium and which mode do you use ?

pierreozoux commented 8 months ago

Everything is documented here: https://gitlab.mim-libre.fr/rizomo/cluster-prod/-/blob/main/cluster/README.md

For ciluim we use: https://raw.githubusercontent.com/syself/cluster-api-provider-hetzner/main/templates/cilium/cilium.yaml