sergelogvinov / proxmox-cloud-controller-manager

Kubernetes cloud controller manager for Proxmox
Apache License 2.0
129 stars 16 forks source link

CCM does not labelize nodes #63

Closed LeoShivas closed 1 year ago

LeoShivas commented 1 year ago

Bug Report

Description

Since I've removed kube-proxy and let routing to be done by Cilium, my CCM does not labelize nodes anymore. I made a fresh install.

Logs

Here are some logs from my proxmox-cloud-controller-manager pod :

I1021 20:09:44.245894       1 serving.go:348] Generated self-signed cert in-memory
I1021 20:09:44.836589       1 serving.go:348] Generated self-signed cert in-memory
W1021 20:09:44.836642       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1021 20:09:45.281997       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
I1021 20:09:45.283103       1 controllermanager.go:168] Version: v0.2.0
I1021 20:09:45.634437       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I1021 20:09:45.634500       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I1021 20:09:45.634529       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I1021 20:09:45.634561       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1021 20:09:45.634578       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I1021 20:09:45.634584       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1021 20:09:45.635263       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1697918984\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1697918983\" (2023-10-21 19:09:43 +0000 UTC to 2024-10-20 19:09:43 +0000 UTC (now=2023-10-21 20:09:45.635180211 +0000 UTC))"
I1021 20:09:45.636389       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1697918985\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1697918984\" (2023-10-21 19:09:44 +0000 UTC to 2024-10-20 19:09:44 +0000 UTC (now=2023-10-21 20:09:45.636344007 +0000 UTC))"
I1021 20:09:45.636684       1 secure_serving.go:210] Serving securely on :10258
I1021 20:09:45.637374       1 leaderelection.go:250] attempting to acquire leader lease kube-system/cloud-controller-manager-proxmox...
I1021 20:09:45.636900       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I1021 20:09:45.734941       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I1021 20:09:45.734939       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1021 20:09:45.735291       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1021 20:09:45.735365       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2023-10-21 12:28:31 +0000 UTC to 2033-10-18 12:28:31 +0000 UTC (now=2023-10-21 20:09:45.73533554 +0000 UTC))"
I1021 20:09:45.735995       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1697918984\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1697918983\" (2023-10-21 19:09:43 +0000 UTC to 2024-10-20 19:09:43 +0000 UTC (now=2023-10-21 20:09:45.73597482 +0000 UTC))"
I1021 20:09:45.736573       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1697918985\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1697918984\" (2023-10-21 19:09:44 +0000 UTC to 2024-10-20 19:09:44 +0000 UTC (now=2023-10-21 20:09:45.736536011 +0000 UTC))"
I1021 20:09:45.736688       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2023-10-21 12:28:30 +0000 UTC to 2033-10-18 12:28:30 +0000 UTC (now=2023-10-21 20:09:45.736668497 +0000 UTC))"
I1021 20:09:45.736887       1 tlsconfig.go:178] "Loaded client CA" index=1 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2023-10-21 12:28:31 +0000 UTC to 2033-10-18 12:28:31 +0000 UTC (now=2023-10-21 20:09:45.736866132 +0000 UTC))"
I1021 20:09:45.738574       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1697918984\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1697918983\" (2023-10-21 19:09:43 +0000 UTC to 2024-10-20 19:09:43 +0000 UTC (now=2023-10-21 20:09:45.73855163 +0000 UTC))"
I1021 20:09:45.739395       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1697918985\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1697918984\" (2023-10-21 19:09:44 +0000 UTC to 2024-10-20 19:09:44 +0000 UTC (now=2023-10-21 20:09:45.739372846 +0000 UTC))"
I1021 20:10:02.179596       1 leaderelection.go:260] successfully acquired lease kube-system/cloud-controller-manager-proxmox
I1021 20:10:02.179991       1 event.go:307] "Event occurred" object="kube-system/cloud-controller-manager-proxmox" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="proxmox-cloud-controller-manager-85d56dfbd6-449h4_94ace4d2-576d-44dc-8428-f77aa93d5c80 became leader"
I1021 20:10:03.124143       1 cloud.go:64] clientset initialized
I1021 20:10:03.156349       1 cloud.go:83] proxmox initialized
W1021 20:10:03.156373       1 controllermanager.go:314] "node-route-controller" is disabled
W1021 20:10:03.156381       1 controllermanager.go:314] "cloud-node-controller" is disabled
I1021 20:10:03.156387       1 controllermanager.go:318] Starting "cloud-node-lifecycle-controller"
I1021 20:10:03.160296       1 controllermanager.go:337] Started "cloud-node-lifecycle-controller"
W1021 20:10:03.160317       1 controllermanager.go:314] "service-lb-controller" is disabled
I1021 20:10:03.160511       1 node_lifecycle_controller.go:113] Sending events to api server
E1021 20:34:05.883135       1 leaderelection.go:332] error retrieving resource lock kube-system/cloud-controller-manager-proxmox: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager-proxmox?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Environment

Additionnal information

I've created my K8S cluster with these Ansible steps :

    - name: Create init conf file (for adding serverTLSBootstrap option)
      copy:
        dest: /etc/kubernetes/kubeadm-init.yaml
        content: |
          apiVersion: kubeadm.k8s.io/v1beta3
          kind: ClusterConfiguration
          controlPlaneEndpoint: "{{ kube_endpoint }}:6443"
          ---
          apiVersion: kubelet.config.k8s.io/v1beta1
          kind: KubeletConfiguration
          serverTLSBootstrap: true
        mode: 0644
    - name: Initialize Kubernetes cluster
      shell: kubeadm init --upload-certs --skip-phases=addon/kube-proxy --config /etc/kubernetes/kubeadm-init.yaml | tee kubeadm-init-`date '+%Y-%m-%d_%H-%M-%S'`.out

I've installed Cilium with these Ansible steps :

  - name: Add cilium helm repository
    kubernetes.core.helm_repository:
      name: cilium
      repo_url: "https://helm.cilium.io/"

  - name: Retreive cilium CNI version
    shell: curl -s https://raw.githubusercontent.com/cilium/cilium/main/stable.txt
    register: cilium_version
    changed_when: false

  - name: Install cilium chart
    kubernetes.core.helm:
      name: cilium
      namespace: kube-system
      chart_ref: cilium/cilium
      chart_version: "{{ cilium_version.stdout }}"
      values:
        kubeProxyReplacement: true
        bpf.masquerade: true
        k8sServiceHost: "{{ kube_endpoint }}"
        k8sServicePort: 6443

Here are the Ansible steps I used for deploying CCM and CSI plugin :

  - name: Install Proxmox CCM chart
    kubernetes.core.helm:
      name: proxmox-cloud-controller-manager
      namespace: kube-system
      chart_ref: oci://ghcr.io/sergelogvinov/charts/proxmox-cloud-controller-manager
      values:
        config:
          clusters:
            - url: "{{ proxmox_url }}"
              insecure: false
              token_id: "kubernetes@pve!ccm"
              token_secret: "xxxxxxxxxxxxxxxxx"
              region: main
        enabledControllers:
          - cloud-node-lifecycle
        nodeSelector:
          node-role.kubernetes.io/control-plane: ""
        tolerations:
          - key: node-role.kubernetes.io/control-plane
            effect: NoSchedule

  - name: Create Proxmox CSI namespace
    kubernetes.core.k8s:
      state: present
      definition:
        api_version: v1
        kind: Namespace
        metadata:
          name: csi-proxmox
          labels:
            app.kubernetes.io/managed-by: Helm
            pod-security.kubernetes.io/enforce: privileged
          annotations:
            meta.helm.sh/release-name: proxmox-csi-plugin
            meta.helm.sh/release-namespace: csi-proxmox

  - name: Install Proxmox CSI chart
    kubernetes.core.helm:
      name: proxmox-csi-plugin
      namespace: csi-proxmox
      chart_ref: oci://ghcr.io/sergelogvinov/charts/proxmox-csi-plugin
      values:
        config:
          clusters:
            - url: "{{ proxmox_url }}"
              insecure: false
              token_id: "kubernetes-csi@pve!csi"
              token_secret: "yyyyyyyyyyyyyyyyyyy"
              region: main
        node:
          nodeSelector:
          tolerations:
            - operator: Exists
        nodeSelector:
          node-role.kubernetes.io/control-plane: ""
        tolerations:
          - key: node-role.kubernetes.io/control-plane
            effect: NoSchedule
        storageClass:
          - name: proxmox-data
            storage: local
            reclaimPolicy: Delete
            fstype: ext4
            cache: none

Here is one of my worker node :

apiVersion: v1
kind: Node
metadata:
  annotations:
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2023-10-21T12:35:47Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: kube-wk-1
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
  name: kube-wk-1
  resourceVersion: "103848"
  uid: 7b112df8-f186-48c9-921b-eec6aa1c0833
spec: {}
status:
  addresses:
  - address: 192.168.1.104
    type: InternalIP
  - address: kube-wk-1
    type: Hostname
  allocatable:
    cpu: "2"
    ephemeral-storage: "47265173836"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 1705900Ki
    pods: "110"
  capacity:
    cpu: "2"
    ephemeral-storage: 51285996Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 1808300Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-10-21T12:44:17Z"
    lastTransitionTime: "2023-10-21T12:44:17Z"
    message: Cilium is running on this node
    reason: CiliumIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2023-10-21T20:48:06Z"
    lastTransitionTime: "2023-10-21T12:35:47Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-10-21T20:48:06Z"
    lastTransitionTime: "2023-10-21T12:35:47Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2023-10-21T20:53:12Z"
    lastTransitionTime: "2023-10-21T12:35:47Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2023-10-21T20:53:12Z"
    lastTransitionTime: "2023-10-21T12:43:21Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - quay.io/cilium/cilium@sha256:e5ca22526e01469f8d10c14e2339a82a13ad70d9a359b879024715540eef4ace
    sizeBytes: 184477234
  - names:
    - ghcr.io/sergelogvinov/proxmox-csi-node@sha256:3ebc0d60f1d664dd1550ec1779c8929936a3eb700e6da643e50a43b8118be4d9
    - ghcr.io/sergelogvinov/proxmox-csi-node:v0.3.0
    sizeBytes: 28682577
  - names:
    - ghcr.io/postfinance/kubelet-csr-approver@sha256:bafbb479878906bfb69d635144027b90d42e58fb52d12b0663f77a58dd4fe416
    - ghcr.io/postfinance/kubelet-csr-approver:v1.0.5
    sizeBytes: 27379324
  - names:
    - quay.io/cilium/operator-generic@sha256:c9613277b72103ed36e9c0d16b9a17cafd507461d59340e432e3e9c23468b5e2
    sizeBytes: 25151466
  - names:
    - ghcr.io/sergelogvinov/proxmox-cloud-controller-manager@sha256:954ec00288cbed35afc9a6abf8b0bb28f812942b5df7b636e31ad8a38933d15d
    - ghcr.io/sergelogvinov/proxmox-cloud-controller-manager:v0.2.0
    sizeBytes: 17003810
  - names:
    - registry.k8s.io/coredns/coredns@sha256:a0ead06651cf580044aeb0a0feba63591858fb2e43ade8c9dea45a6a89ae7e5e
    - registry.k8s.io/coredns/coredns:v1.10.1
    sizeBytes: 16190758
  - names:
    - registry.k8s.io/sig-storage/csi-node-driver-registrar@sha256:cd21e19cd8bbd5bc56f1b4f1398a436e7897da2995d6d036c9729be3f4e456e6
    - registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.0
    sizeBytes: 10755934
  - names:
    - registry.k8s.io/sig-storage/livenessprobe@sha256:82adbebdf5d5a1f40f246aef8ddbee7f89dea190652aefe83336008e69f9a89f
    - registry.k8s.io/sig-storage/livenessprobe:v2.11.0
    sizeBytes: 9696710
  - names:
    - registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db
    - registry.k8s.io/pause:3.6
    sizeBytes: 301773
  nodeInfo:
    architecture: amd64
    bootID: 3af35a66-2cf4-4b27-b041-6a4d44b2b52f
    containerRuntimeVersion: containerd://1.6.24
    kernelVersion: 5.14.0-284.30.1.el9_2.x86_64
    kubeProxyVersion: v1.27.3
    kubeletVersion: v1.27.3
    machineID: fb1a9895e56540c4b29f780679468485
    operatingSystem: linux
    osImage: Rocky Linux 9.2 (Blue Onyx)
    systemUUID: fb1a9895-e565-40c4-b29f-780679468485
sergelogvinov commented 1 year ago

Hello,

  1. CCM cannot connect to kubernetes api
error retrieving resource lock kube-system/cloud-controller-manager-proxmox: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager-proxmox?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Check cilium logs.

  1. enabledControllers

cloud-node - initialize (labels) the nodes cloud-node-lifecycle - only delete the node resource if it was deleted in proxmox.

So use it both:

        enabledControllers:
          - cloud-node
          - cloud-node-lifecycle
sergelogvinov commented 1 year ago

PS, my cilium config - https://github.com/sergelogvinov/terraform-talos/blob/main/_deployments/vars/cilium.yaml

LeoShivas commented 1 year ago

Since my last message, I've re-enabled the kube-proxy.

I've also corrected the enabledControllers and set it as follow :

  - name: Install Proxmox CCM chart
    kubernetes.core.helm:
      name: proxmox-cloud-controller-manager
      namespace: kube-system
      chart_ref: oci://ghcr.io/sergelogvinov/charts/proxmox-cloud-controller-manager
      values:
        config:
          clusters:
            - url: "{{ proxmox_url }}"
              insecure: false
              token_id: "kubernetes@pve!ccm"
              token_secret: "xxxxxxxxxxxxxxxxxx"
              region: main
        enabledControllers:
          - cloud-node
          - cloud-node-lifecycle
        nodeSelector:
          node-role.kubernetes.io/control-plane: ""
        tolerations:
          - key: node-role.kubernetes.io/control-plane
            effect: NoSchedule

Here are the logs I encountered in the kube-system/proxmox-cloud-controller-manager-7b85484c94 pod :

E1023 07:35:40.972347       1 leaderelection.go:332] error retrieving resource lock kube-system/cloud-controller-manager-proxmox: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager-proxmox?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I1023 07:36:38.058263       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-cp-1?
I1023 07:36:38.058542       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-cp-2?
I1023 07:36:38.058650       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-wk-1?
I1023 07:36:38.058697       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-wk-2?
I1023 07:36:38.058780       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-wk-3?
I1023 07:36:38.097378       1 node_controller.go:267] Update 5 nodes status took 299.025382ms.

I will have a look to you Cilium configuration.

sergelogvinov commented 1 year ago

You need to have --cloud-provider param in kubelet daemon Without it, nodes will initialize by themself.

LeoShivas commented 1 year ago

You need to have --cloud-provider param in kubelet daemon Without it, nodes will initialize by themself.

But as the official kubelet documentation states :

--cloud-provider string
The provider for cloud services. Set to empty string for running with no cloud provider. If set, the cloud provider 
determines the name of the node (consult cloud provider documentation to determine if and how the hostname 
is used). (DEPRECATED: will be removed in 1.24 or later, in favor of removing cloud provider code from kubelet.)

Wait ! As I'm writing, I see that they have "Undeprecated kubelet cloud-provider flag" 4 days ago !

sergelogvinov commented 1 year ago

most of the cases DEPRECATED means you need to use kubelet config.yaml :)

LeoShivas commented 1 year ago

Yes, you're surely right.

But I don't find the --cloud-provider equivalent option for the KubeletConfiguration objet : https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration

Here's mine :

          apiVersion: kubelet.config.k8s.io/v1beta1
          kind: KubeletConfiguration
          serverTLSBootstrap: true
          providerID: "proxmox://mycluster/mypvenode"
sergelogvinov commented 1 year ago

Proxmox CCM can set providerID for you... (if name of the node == name VM in proxmox)

LeoShivas commented 1 year ago

I've manually added the --cloud-provider on the command line start to the kubelet service :

[root@kube-cp-1 ~]# systemctl cat kubelet.service
# /usr/lib/systemd/system/kubelet.service
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

# /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/sysconfig/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet --cloud-provider=external $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

I've run the following commands :

systemctl daemon-reload
systemctl restart kubelet.service

I've deleted the kube-system/proxmox-cloud-controller-manager- pod and waited a few time.

Here the logs of my kubelet service :

[root@kube-cp-1 ~]# journalctl -u kubelet -f
Oct 23 11:02:05 kube-cp-1 kubelet[47031]:   "Metadata": null
Oct 23 11:02:05 kube-cp-1 kubelet[47031]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock: connect: connection refused"
Oct 23 11:02:08 kube-cp-1 kubelet[47031]: W1023 11:02:08.103331   47031 logging.go:59] [core] [Channel #60 SubChannel #61] grpc: addrConn.createTransport failed to connect to {
Oct 23 11:02:08 kube-cp-1 kubelet[47031]:   "Addr": "/var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock",
Oct 23 11:02:08 kube-cp-1 kubelet[47031]:   "ServerName": "/var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock",
Oct 23 11:02:08 kube-cp-1 kubelet[47031]:   "Attributes": null,
Oct 23 11:02:08 kube-cp-1 kubelet[47031]:   "BalancerAttributes": null,
Oct 23 11:02:08 kube-cp-1 kubelet[47031]:   "Type": 0,
Oct 23 11:02:08 kube-cp-1 kubelet[47031]:   "Metadata": null
Oct 23 11:02:08 kube-cp-1 kubelet[47031]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock: connect: connection refused"
Oct 23 11:02:11 kube-cp-1 kubelet[47031]: W1023 11:02:11.775421   47031 logging.go:59] [core] [Channel #60 SubChannel #61] grpc: addrConn.createTransport failed to connect to {
Oct 23 11:02:11 kube-cp-1 kubelet[47031]:   "Addr": "/var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock",
Oct 23 11:02:11 kube-cp-1 kubelet[47031]:   "ServerName": "/var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock",
Oct 23 11:02:11 kube-cp-1 kubelet[47031]:   "Attributes": null,
Oct 23 11:02:11 kube-cp-1 kubelet[47031]:   "BalancerAttributes": null,
Oct 23 11:02:11 kube-cp-1 kubelet[47031]:   "Type": 0,
Oct 23 11:02:11 kube-cp-1 kubelet[47031]:   "Metadata": null
Oct 23 11:02:11 kube-cp-1 kubelet[47031]: }. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock: connect: connection refused"
Oct 23 11:02:12 kube-cp-1 kubelet[47031]: E1023 11:02:12.487082   47031 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock" failed. No retries permitted until 2023-10-23 11:02:20.487055393 +0200 CEST m=+213.991348269 (durationBeforeRetry 8s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/csi.proxmox.sinextra.dev-reg.sock, err: context deadline exceeded
Oct 23 11:02:13 kube-cp-1 kubelet[47031]: I1023 11:02:13.868747   47031 scope.go:115] "RemoveContainer" containerID="6c85953efd6b2a685899bd9c7c7fcaf758ee61c6a2efb3ad109483f33fa95b24"
Oct 23 11:02:13 kube-cp-1 kubelet[47031]: E1023 11:02:13.869277   47031 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"csi-node-driver-registrar\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=csi-node-driver-registrar pod=proxmox-csi-plugin-node-bmn4q_csi-proxmox(3a4674a9-5e87-4d30-bb0f-de83ecafa20b)\"" pod="csi-proxmox/proxmox-csi-plugin-node-bmn4q" podUID=3a4674a9-5e87-4d30-bb0f-de83ecafa20b

I haven't deployed the CSI plugin yet.

Proxmox CCM can set providerID for you... (if name of the node == name VM in proxmox)

Yeah, I know it's supposed to do it :-) but as it seems my CCM doesn't work ... :-(

LeoShivas commented 1 year ago

I've reinstalled all from scratch (as usual :-) ). If you want (and you have time do so), you can have a look to my ansible playbook deployment : https://github.com/LeoShivas/GitOps/blob/main/ansible/playbooks/kubernetes/playbook-kube-install.yml

I've updated the kubelet service adding the --cloud-provider option on one node (kube-cp-1) and rebooted it.

Here are the kube-system/proxmox-cloud-controller-manager-xxxxx pod logs :

I1023 11:17:10.406986       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-cp-1?
I1023 11:17:10.407037       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-cp-2?
I1023 11:17:10.407057       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-wk-1?
I1023 11:17:10.407280       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-wk-2?
I1023 11:17:10.407315       1 instances.go:159] instances.InstanceMetadata() is kubelet has --cloud-provider=external on the node kube-wk-3?
I1023 11:17:10.407464       1 node_controller.go:267] Update 5 nodes status took 569.596µs.

Here are the kubelet systemd service logs (after the reboot, when I delete the ccm pod) :

Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.241837     908 scope.go:115] "RemoveContainer" containerID="00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.317350     908 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"cloud-config\" (UniqueName: \"kubernetes.io/secret/d0390647-30eb-41ca-93cc-5726172d86c8-cloud-config\") pod \"d0390647-30eb-41ca-93cc-5726172d86c8\" (UID: \"d0390647-30eb-41ca-93cc-5726172d86c8\") "
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.317434     908 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"kube-api-access-2ghs9\" (UniqueName: \"kubernetes.io/projected/d0390647-30eb-41ca-93cc-5726172d86c8-kube-api-access-2ghs9\") pod \"d0390647-30eb-41ca-93cc-5726172d86c8\" (UID: \"d0390647-30eb-41ca-93cc-5726172d86c8\") "
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.339730     908 operation_generator.go:878] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/d0390647-30eb-41ca-93cc-5726172d86c8-cloud-config" (OuterVolumeSpecName: "cloud-config") pod "d0390647-30eb-41ca-93cc-5726172d86c8" (UID: "d0390647-30eb-41ca-93cc-5726172d86c8"). InnerVolumeSpecName "cloud-config". PluginName "kubernetes.io/secret", VolumeGidValue ""
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.343287     908 operation_generator.go:878] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/d0390647-30eb-41ca-93cc-5726172d86c8-kube-api-access-2ghs9" (OuterVolumeSpecName: "kube-api-access-2ghs9") pod "d0390647-30eb-41ca-93cc-5726172d86c8" (UID: "d0390647-30eb-41ca-93cc-5726172d86c8"). InnerVolumeSpecName "kube-api-access-2ghs9". PluginName "kubernetes.io/projected", VolumeGidValue ""
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.418646     908 reconciler_common.go:300] "Volume detached for volume \"kube-api-access-2ghs9\" (UniqueName: \"kubernetes.io/projected/d0390647-30eb-41ca-93cc-5726172d86c8-kube-api-access-2ghs9\") on node \"kube-cp-1\" DevicePath \"\""
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.418696     908 reconciler_common.go:300] "Volume detached for volume \"cloud-config\" (UniqueName: \"kubernetes.io/secret/d0390647-30eb-41ca-93cc-5726172d86c8-cloud-config\") on node \"kube-cp-1\" DevicePath \"\""
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.670881     908 scope.go:115] "RemoveContainer" containerID="8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.704316     908 topology_manager.go:212] "Topology Admit Handler"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: E1023 13:26:43.704617     908 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="d0390647-30eb-41ca-93cc-5726172d86c8" containerName="proxmox-cloud-controller-manager"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: E1023 13:26:43.704775     908 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="d0390647-30eb-41ca-93cc-5726172d86c8" containerName="proxmox-cloud-controller-manager"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.704917     908 memory_manager.go:346] "RemoveStaleState removing state" podUID="d0390647-30eb-41ca-93cc-5726172d86c8" containerName="proxmox-cloud-controller-manager"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.705007     908 memory_manager.go:346] "RemoveStaleState removing state" podUID="d0390647-30eb-41ca-93cc-5726172d86c8" containerName="proxmox-cloud-controller-manager"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.767041     908 scope.go:115] "RemoveContainer" containerID="00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: E1023 13:26:43.767874     908 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3\": not found" containerID="00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.767980     908 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={Type:containerd ID:00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3} err="failed to get container status \"00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3\": rpc error: code = NotFound desc = an error occurred when try to find container \"00e3711823c388eee774b6a9d2c6e8a9ebe00ee8b8f24243fabecfa7f74222e3\": not found"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.768010     908 scope.go:115] "RemoveContainer" containerID="8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: E1023 13:26:43.768488     908 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e\": not found" containerID="8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.768523     908 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={Type:containerd ID:8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e} err="failed to get container status \"8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e\": rpc error: code = NotFound desc = an error occurred when try to find container \"8a483e45dc67a5df029d1a8473cabadaa418ab8427d1694c87bacb61da53503e\": not found"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.819850     908 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"cloud-config\" (UniqueName: \"kubernetes.io/secret/86a021a4-5f5c-4b6a-b86e-4b28fa9f06c4-cloud-config\") pod \"proxmox-cloud-controller-manager-7b85484c94-dc6ll\" (UID: \"86a021a4-5f5c-4b6a-b86e-4b28fa9f06c4\") " pod="kube-system/proxmox-cloud-controller-manager-7b85484c94-dc6ll"
Oct 23 13:26:43 kube-cp-1 kubelet[908]: I1023 13:26:43.819910     908 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-zlkp4\" (UniqueName: \"kubernetes.io/projected/86a021a4-5f5c-4b6a-b86e-4b28fa9f06c4-kube-api-access-zlkp4\") pod \"proxmox-cloud-controller-manager-7b85484c94-dc6ll\" (UID: \"86a021a4-5f5c-4b6a-b86e-4b28fa9f06c4\") " pod="kube-system/proxmox-cloud-controller-manager-7b85484c94-dc6ll"
Oct 23 13:26:46 kube-cp-1 kubelet[908]: I1023 13:26:46.187057     908 kubelet_volumes.go:161] "Cleaned up orphaned pod volumes dir" podUID=d0390647-30eb-41ca-93cc-5726172d86c8 path="/var/lib/kubelet/pods/d0390647-30eb-41ca-93cc-5726172d86c8/volumes"
LeoShivas commented 1 year ago

I've updated my init step by adding a .nodeRegistration.kubeletExtraArgs line :

    - name: Create init conf file (for adding serverTLSBootstrap option)
      copy:
        dest: /etc/kubernetes/kubeadm-init.yaml
        content: |
          apiVersion: kubeadm.k8s.io/v1beta3
          kind: ClusterConfiguration
          controlPlaneEndpoint: "{{ kube_endpoint }}:6443"
          ---
          apiVersion: kubelet.config.k8s.io/v1beta1
          kind: KubeletConfiguration
          serverTLSBootstrap: true
          ---
          apiVersion: kubeadm.k8s.io/v1beta3
          kind: InitConfiguration
          nodeRegistration:
            kubeletExtraArgs:
              cloud-provider: "external"
        mode: 0644

My kubelet service starts well with the --cloud-provider=external option :

[root@kube-cp-1 ~]# systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /usr/lib/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Mon 2023-10-23 13:58:18 CEST; 21min ago
       Docs: https://kubernetes.io/docs/
   Main PID: 8726 (kubelet)
      Tasks: 14 (limit: 10842)
     Memory: 102.6M
        CPU: 22.736s
     CGroup: /system.slice/kubelet.service
             └─8726 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cloud-provider=external >

The error are still present in the CCM. I don't know what to do to go further.

sergelogvinov commented 1 year ago

i think you need to delete the node resource first, and then restart the kubelet, because the node already initialized... Also try to set --nodeip=${INTERFACE_IP} in kubelet param (in case if you have more then one IP)

LeoShivas commented 1 year ago

i think you need to delete the node resource first, and then restart the kubelet, because the node already initialized... Also try to set --nodeip=${INTERFACE_IP} in kubelet param (in case if you have more then one IP)

My last comment is the result after a destroy/recreate VMs.

I only have a single "public/private" IP per node :

[root@kube-cp-1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether e2:4d:3e:3a:e5:e0 brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    altname ens18
    inet 192.168.1.105/24 brd 192.168.1.255 scope global dynamic noprefixroute eth0
       valid_lft 6501sec preferred_lft 6501sec
    inet6 fe80::e04d:3eff:fe3a:e5e0/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
3: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 46:5b:12:f5:f9:57 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::445b:12ff:fef5:f957/64 scope link
       valid_lft forever preferred_lft forever
4: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 2e:ca:5f:4c:50:20 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.219/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::2cca:5fff:fe4c:5020/64 scope link
       valid_lft forever preferred_lft forever
5: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether e6:7b:07:d4:36:07 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e47b:7ff:fed4:3607/64 scope link
       valid_lft forever preferred_lft forever
7: lxc_health@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8a:ca:4d:2f:cf:07 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::88ca:4dff:fe2f:cf07/64 scope link
       valid_lft forever preferred_lft forever
sergelogvinov commented 1 year ago

can you show: kubectl describe node kube-cp-1

LeoShivas commented 1 year ago

Yes, sure :

Name:               kube-cp-1
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kube-cp-1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 23 Oct 2023 13:58:07 +0200
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  kube-cp-1
  AcquireTime:     <unset>
  RenewTime:       Mon, 23 Oct 2023 19:13:57 +0200
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 23 Oct 2023 14:10:44 +0200   Mon, 23 Oct 2023 14:10:44 +0200   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Mon, 23 Oct 2023 19:11:15 +0200   Mon, 23 Oct 2023 13:58:07 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 23 Oct 2023 19:11:15 +0200   Mon, 23 Oct 2023 13:58:07 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 23 Oct 2023 19:11:15 +0200   Mon, 23 Oct 2023 13:58:07 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 23 Oct 2023 19:11:15 +0200   Mon, 23 Oct 2023 14:09:56 +0200   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.1.105
  Hostname:    kube-cp-1
Capacity:
  cpu:                2
  ephemeral-storage:  51285996Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1808300Ki
  pods:               110
Allocatable:
  cpu:                2
  ephemeral-storage:  47265173836
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1705900Ki
  pods:               110
System Info:
  Machine ID:                 d359f13c93ac46b5b2f1bac975e4cbae
  System UUID:                d359f13c-93ac-46b5-b2f1-bac975e4cbae
  Boot ID:                    57aa57fe-79c3-408b-b5cb-7d32ed57b8f2
  Kernel Version:             5.14.0-284.30.1.el9_2.x86_64
  OS Image:                   Rocky Linux 9.2 (Blue Onyx)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.24
  Kubelet Version:            v1.27.3
  Kube-Proxy Version:         v1.27.3
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
  kube-system                 cilium-2brfz                         100m (5%)     0 (0%)      100Mi (6%)       0 (0%)         5h9m
  kube-system                 etcd-kube-cp-1                       100m (5%)     0 (0%)      100Mi (6%)       0 (0%)         5h15m
  kube-system                 kube-apiserver-kube-cp-1             250m (12%)    0 (0%)      0 (0%)           0 (0%)         5h15m
  kube-system                 kube-controller-manager-kube-cp-1    200m (10%)    0 (0%)      0 (0%)           0 (0%)         5h15m
  kube-system                 kube-proxy-j276n                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         5h15m
  kube-system                 kube-scheduler-kube-cp-1             100m (5%)     0 (0%)      0 (0%)           0 (0%)         5h15m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                750m (37%)   0 (0%)
  memory             200Mi (12%)  0 (0%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:              <none>
sergelogvinov commented 1 year ago

Oh, It does not have alpha.kubernetes.io/provided-node-ip annotation. Check again kubelet params ps axfwww it has to have --cloud-provider=external

may be you need run systemctl daemon-reload too

LeoShivas commented 1 year ago

I've progressed.

Thanks a lot for all your work and the time you give to me.

In my requirements ansible scripts, I've added the folowing steps :

- name: Add node IP in kubelet config
  lineinfile:
    path: /etc/sysconfig/kubelet
    regexp: '^KUBELET_EXTRA_ARGS='
    line: KUBELET_EXTRA_ARGS=--node-ip={{ ansible_default_ipv4.address }} --cloud-provider=external
    state: present
    create: yes
    mode: a+r

So, all my node have the alpha.kubernetes.io/provided-node-ip annotation.

But they now have the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint.

Here is my helm install step :

  - name: Install Proxmox CCM chart
    kubernetes.core.helm:
      name: proxmox-cloud-controller-manager
      namespace: kube-system
      chart_ref: oci://ghcr.io/sergelogvinov/charts/proxmox-cloud-controller-manager
      values:
        config:
          clusters:
            - url: "{{ proxmox_url }}"
              insecure: false
              token_id: "kubernetes@pve!ccm"
              token_secret: "xxxxxxxxxxxxxxxxxxx"
              region: main
        enabledControllers:
          - cloud-node
          - cloud-node-lifecycle
        nodeSelector:
          node-role.kubernetes.io/control-plane: ""

Very strange behavior : CCM can't interact with my nodes because coredns is in a pending state.

When I edit the coredns deploy by adding this toleration :

      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        value: "true"

the CCM removes the node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taint and all the pods go in a running state.

Is it normal that I had to update the coredns deploy to make it work ?

sergelogvinov commented 1 year ago

Congratulations!!! You've got it 👍

Yep, coredns should has this toleration. Or you can run CCM as daemonset with host-network... But we do not patch the network - it doesn't make sense.

LeoShivas commented 1 year ago

I've finally succeeded to make it work !

The mainly part are :

Set the node-ip and the cloud-provider option to kubelet

- name: Add node IP in kubelet config
  lineinfile:
    path: /etc/sysconfig/kubelet
    regexp: '^KUBELET_EXTRA_ARGS='
    line: KUBELET_EXTRA_ARGS=--node-ip={{ ansible_default_ipv4.address }} --cloud-provider=external
    state: present
    create: yes
    mode: a+r

Patch the coredns deployment in order to let the CCM interact with the nodes

- name: Patch coredns tolerations
  kubernetes.core.k8s:
    kind: Deployment
    name: coredns
    namespace: kube-system
    definition:
      spec:
        template:
          spec:
            tolerations:
            - key: node.cloudprovider.kubernetes.io/uninitialized
              effect: NoSchedule
              operator: Exists
  become: no

Install the CCM

  - name: Install Proxmox CCM chart
    kubernetes.core.helm:
      name: proxmox-cloud-controller-manager
      namespace: kube-system
      chart_ref: oci://ghcr.io/sergelogvinov/charts/proxmox-cloud-controller-manager
      values:
        config:
          clusters:
            - url: "{{ proxmox_url }}"
              insecure: false
              token_id: "kubernetes@pve!ccm"
              token_secret: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
              region: main
        enabledControllers:
          - cloud-node
          - cloud-node-lifecycle
        nodeSelector:
          node-role.kubernetes.io/control-plane: ""

Install the CSI plugin

  - name: Create Proxmox CSI namespace
    kubernetes.core.k8s:
      state: present
      definition:
        api_version: v1
        kind: Namespace
        metadata:
          name: csi-proxmox
          labels:
            app.kubernetes.io/managed-by: Helm
            pod-security.kubernetes.io/enforce: privileged
          annotations:
            meta.helm.sh/release-name: proxmox-csi-plugin
            meta.helm.sh/release-namespace: csi-proxmox

  - name: Install Proxmox CSI chart
    kubernetes.core.helm:
      name: proxmox-csi-plugin
      namespace: csi-proxmox
      chart_ref: oci://ghcr.io/sergelogvinov/charts/proxmox-csi-plugin
      values:
        config:
          clusters:
            - url: "{{ proxmox_url }}"
              insecure: false
              token_id: "kubernetes-csi@pve!csi"
              token_secret: "yyyyyyyyyyyyyyyyyyyyyyyyyyyy"
              region: main
        node:
          nodeSelector:
          tolerations:
            - operator: Exists
        nodeSelector:
          node-role.kubernetes.io/control-plane: ""
        tolerations:
          - key: node-role.kubernetes.io/control-plane
            effect: NoSchedule
        storageClass:
          - name: proxmox-data
            storage: local
            reclaimPolicy: Delete
            fstype: ext4
            cache: none

Next step : replace kube-proxy by cilium

Great thanks again for you work !

Maybe some clarifications can be made in the documentation. I may fork your repo and make a PR.

morsik commented 1 month ago

@sergelogvinov the info about need for --node-ip and/or alpha.kubernetes.io/provided-node-ip should be first thing in Install docs :/ I spend few hours trying to find out why this simply doesn't work after finding this issue and hitting my head with facepalm, as no other custom cloud controller I've used before used this option.

sergelogvinov commented 1 month ago

@morsik I’m truly sorry to hear that.

I've updated the documentation https://github.com/sergelogvinov/proxmox-cloud-controller-manager/blob/main/docs/install.md#requirements

Thank you for contributing to the project!

morsik commented 1 month ago

@sergelogvinov thank you! It works great after I discovered this simple change, but it real pain to understand why I'm getting node IP errors and then I hit source code and found this issue.

BTW, this is not true at all:

If your node has multiple IP addresses

You explicitly look for that annotation in your source code! I had single IP address and it still didn't worked for that very reason!

sergelogvinov commented 1 month ago

The kubelet sets the value of node.ObjectMeta.Annotations[cloudproviderapi.AnnotationAlphaProvidedIPAddr] during the cluster join process. It can be one or two IPs from different stacks. There are many cases then IPs may fluctuate after a restart... So --node-ip recommended value.

This list of IPs sets as NodeInternalIP in kubernetes node resource.


The node resource contains many immutable values that the CCM cannot modify after initialization. If you run the kubelet without the --cloud-provider=external flag initially and then enable it later, the CCM will not make any changes because the node has already been initialized by the kubelet.

Therefore, if you need to change certain kubelet flags, it’s recommended to delete the node resource first to ensure the changes take effect.

morsik commented 1 month ago

@sergelogvinov interesting... I've never seen such annotation at all.

I just installed fresh 1.31.1 cluster yesterday, I also installed previously 1.29 and 1.30 fresh and never saw such annotation, even though I had single network interface with single IP.

Regarding "the CCM will not make any changes because the node has already been initialized by the kubelet" - I've already explained in another discussion how to fix this and retrigger initialization ;)