hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
740 stars 118 forks source link

Controller does not properly add node metadata #620

Closed gecube closed 5 months ago

gecube commented 9 months ago

TL;DR

I set up talos.dev cluster on hcloud and I am expecting that the HCCM will populate node objects with metadata and I will be able to order load balancers.

Expected behavior

The node objects are populated with metadata. The load balancers are created. There are no errors in logs of HCCM

Observed behavior

I set up the cluster according to the instructions here: https://www.talos.dev/v1.6/talos-guides/install/cloud-platforms/hetzner/

I introduced several changes. First of all, I created a virtual machines with private network attached. Then I prepared a talos patch file looking like:

$ cat patch.yaml 
cluster:
  network:
    cni:
      name: none
    podSubnets:
      - 100.64.0.0/16
    serviceSubnets:
      - 100.96.0.0/16
  proxy:
    disabled: true
  etcd:
    advertisedSubnets:
      - 10.0.0.0/8

machine:
  kubelet:
    extraArgs:
      cloud-provider: external
    nodeIP:
      validSubnets:
        - 10.0.0.0/8

and applied it when creating the cluster. The idea was to use the private subnet to join cluster nodes and avoid using public subnets for the cluster connectivity.

Minimal working example

No response

Log output

nodes
kubectl get nodes -owide
NAME                    STATUS   ROLES           AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-control-plane-1   Ready    control-plane   52m    v1.29.1   10.0.0.2      <none>        Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-control-plane-2   Ready    control-plane   4d4h   v1.29.1   10.0.0.3      <none>        Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-control-plane-3   Ready    control-plane   4d4h   v1.29.1   10.0.0.4      <none>        Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-worker-1          Ready    <none>          51m    v1.29.1   10.0.0.5      <none>        Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-worker-2          Ready    <none>          50m    v1.29.1   10.0.0.6      <none>        Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11

the logs of CCM:

I0215 21:44:34.638664       1 node_controller.go:431] Initializing node talos-worker-2 with cloud provider
--- Request:
GET /v1/servers?name=talos-worker-2 HTTP/1.1
Host: api.hetzner.cloud
User-Agent: hcloud-cloud-controller/v1.19.0 hcloud-go/2.4.0
Authorization: REDACTED
Accept-Encoding: gzip

--- Response:
HTTP/2.0 200 OK
Content-Length: 5787
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: DNT,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
Access-Control-Allow-Methods: GET, PUT, POST, DELETE, PATCH, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Max-Age: 1728000
Content-Type: application/json
Date: Thu, 15 Feb 2024 21:44:34 GMT
Link: <https://api.hetzner.cloud/v1/servers?name=talos-worker-2&page=1>; rel=last
Ratelimit-Limit: 3600
Ratelimit-Remaining: 3569
Ratelimit-Reset: 1708033505
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Correlation-Id: 8eff8b5832344a5e

{
  "servers": [
    {
      "id": 43268742,
      "name": "talos-worker-2",
      "status": "running",
      "created": "2024-02-11T17:21:28+00:00",
      "public_net": {
        "ipv4": {
          "ip": "65.108.90.3",
          "blocked": false,
          "dns_ptr": "static.3.90.108.65.clients.your-server.de",
          "id": 51533159
        },
        "ipv6": {
          "ip": "2a01:4f9:c011:bd8c::/64",
          "blocked": false,
          "dns_ptr": [],
          "id": 51533160
        },
        "floating_ips": [],
        "firewalls": []
      },
      "private_net": [
        {
          "network": 3866040,
          "ip": "10.0.0.6",
          "alias_ips": [],
          "mac_address": "86:00:00:77:de:83"
        }
      ],
      "server_type": {
        "id": 98,
        "name": "ccx33",
        "description": "CCX33 Dedicated CPU",
        "cores": 8,
        "memory": 32.0,
        "disk": 240,
        "deprecated": false,
        "prices": [
          {
            "location": "fsn1",
            "price_hourly": {
              "net": "0.0769000000",
              "gross": "0.0769000000000000"
            },
            "price_monthly": {
              "net": "47.9900000000",
              "gross": "47.9900000000000000"
            }
          },
          {
            "location": "nbg1",
            "price_hourly": {
              "net": "0.0769000000",
              "gross": "0.0769000000000000"
            },
            "price_monthly": {
              "net": "47.9900000000",
              "gross": "47.9900000000000000"
            }
          },
          {
            "location": "hel1",
            "price_hourly": {
              "net": "0.0769000000",
              "gross": "0.0769000000000000"
            },
            "price_monthly": {
              "net": "47.9900000000",
              "gross": "47.9900000000000000"
            }
          },
          {
            "location": "ash",
            "price_hourly": {
              "net": "0.0769000000",
              "gross": "0.0769000000000000"
            },
            "price_monthly": {
              "net": "47.9900000000",
              "gross": "47.9900000000000000"
            }
          },
          {
            "location": "hil",
            "price_hourly": {
              "net": "0.0769000000",
              "gross": "0.0769000000000000"
            },
            "price_monthly": {
              "net": "47.9900000000",
              "gross": "47.9900000000000000"
            }
          }
        ],
        "storage_type": "local",
        "cpu_type": "dedicated",
        "architecture": "x86",
        "included_traffic": 32985348833280,
        "deprecation": null
      },
      "datacenter": {
        "id": 3,
        "name": "hel1-dc2",
        "description": "Helsinki 1 virtual DC 2",
        "location": {
          "id": 3,
          "name": "hel1",
          "description": "Helsinki DC Park 1",
          "country": "FI",
          "city": "Helsinki",
          "latitude": 60.169855,
          "longitude": 24.938379,
          "network_zone": "eu-central"
        },
        "server_types": {
          "supported": [
            1,
            3,
            5,
            7,
            9,
            22,
            23,
            24,
            25,
            26,
            45,
            93,
            94,
            95,
            96,
            97,
            98,
            99,
            100,
            101
          ],
          "available": [
            1,
            3,
            5,
            7,
            9,
            22,
            23,
            24,
            25,
            26,
            45,
            93,
            94,
            95,
            96,
            97,
            98,
            99,
            100,
            101
          ],
          "available_for_migration": [
            1,
            3,
            5,
            7,
            9,
            22,
            23,
            24,
            25,
            26,
            45,
            93,
            94,
            95,
            96,
            97,
            98,
            99,
            100,
            101,
            102,
            103
          ]
        }
      },
      "image": {
        "id": 148619575,
        "type": "snapshot",
        "status": "available",
        "name": null,
        "description": "talos system disk - amd64 - v1.6.3",
        "image_size": 0.2891603486328125,
        "disk_size": 20,
        "created": "2024-02-08T12:26:41+00:00",
        "created_from": {
          "id": 43135373,
          "name": "packer-65c4c7e6-96b2-8b71-a041-16c6cc71e1a0"
        },
        "bound_to": null,
        "os_flavor": "debian",
        "os_version": null,
        "rapid_deploy": false,
        "protection": {
          "delete": false
        },
        "deprecated": null,
        "labels": {
          "os": "talos",
          "arch": "amd64",
          "type": "infra",
          "version": "v1.6.3"
        },
        "deleted": null,
        "architecture": "x86"
      },
      "iso": null,
      "rescue_enabled": false,
      "locked": false,
      "backup_window": null,
      "outgoing_traffic": 54605000,
      "ingoing_traffic": 13552325000,
      "included_traffic": 32985348833280,
      "protection": {
        "delete": false,
        "rebuild": false
      },
      "labels": {
        "type": "worker"
      },
      "volumes": [
        100380895
      ],
      "load_balancers": [],
      "primary_disk_size": 240,
      "placement_group": null
    }
  ],
  "meta": {
    "pagination": {
      "page": 1,
      "per_page": 25,
      "previous_page": null,
      "next_page": null,
      "last_page": 1,
      "total_entries": 1
    }
  }
}

E0215 21:44:34.851551       1 node_controller.go:240] error syncing 'talos-worker-2': failed to get node modifiers from cloud provider: provided node ip for node "talos-worker-2" is not valid: failed to get node address from cloud provider that matches ip: 10.0.0.6, requeuing

Additional information

No response

gecube commented 9 months ago

So - shortly - it looks like when nodes have ONLY internal IPs from private hetzner network, for some reason HCCM could not match them.

gecube commented 9 months ago

When running the cluster on the nodes with only public addresses - no issues:

Screenshot 2024-02-19 at 08 21 37
kubectl get nodes -owide
NAME                    STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP      OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-control-plane-1   Ready    control-plane   23h   v1.29.1   <none>        37.27.38.153     Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-control-plane-2   Ready    control-plane   23h   v1.29.1   <none>        168.119.189.58   Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-control-plane-3   Ready    control-plane   23h   v1.29.1   <none>        94.130.150.142   Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-worker-1          Ready    <none>          23h   v1.29.1   <none>        65.108.90.3      Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
talos-worker-2          Ready    <none>          23h   v1.29.1   <none>        65.21.152.91     Talos (v1.6.3)   6.1.74-talos     containerd://1.7.11
kubectl get pods -n kube-system
NAME                                               READY   STATUS    RESTARTS      AGE
cilium-9mmcj                                       1/1     Running   0             23h
cilium-lr87f                                       1/1     Running   0             23h
cilium-nn795                                       1/1     Running   0             23h
cilium-operator-6d6fb6b85f-2n2g6                   1/1     Running   0             23h
cilium-operator-6d6fb6b85f-tt5d2                   1/1     Running   0             23h
cilium-rp9w6                                       1/1     Running   0             23h
cilium-xwt47                                       1/1     Running   0             23h
coredns-85b955d87b-tm47c                           1/1     Running   0             23h
coredns-85b955d87b-vx9zg                           1/1     Running   0             23h
hcloud-cloud-controller-manager-584f6fc4f4-w6zk2   1/1     Running   0             22h
hcloud-csi-controller-68f987547f-cz9cz             5/5     Running   0             22h
hcloud-csi-node-75pps                              3/3     Running   0             22h
hcloud-csi-node-85xlm                              3/3     Running   0             22h
hcloud-csi-node-927pf                              3/3     Running   0             22h
hcloud-csi-node-9w5sz                              3/3     Running   0             22h
hcloud-csi-node-nl94s                              3/3     Running   0             22h
kube-apiserver-talos-control-plane-1               1/1     Running   0             23h
kube-apiserver-talos-control-plane-2               1/1     Running   0             23h
kube-apiserver-talos-control-plane-3               1/1     Running   0             23h
kube-controller-manager-talos-control-plane-1      1/1     Running   2 (23h ago)   23h
kube-controller-manager-talos-control-plane-2      1/1     Running   0             23h
kube-controller-manager-talos-control-plane-3      1/1     Running   1 (23h ago)   23h
kube-scheduler-talos-control-plane-1               1/1     Running   2 (23h ago)   23h
kube-scheduler-talos-control-plane-2               1/1     Running   0             23h
kube-scheduler-talos-control-plane-3               1/1     Running   1 (23h ago)   23h
apricote commented 9 months ago

Hey @gecube,

the error happens because hccm reports a different set of addresses for the node than the node currently has. From the error message, API requests (thanks for including them!) and the kubectl output I think the addresses reported are:

This causes a conflict, because the library we use (kubernetes/cloud-provider) expects that hccm returns all addresses that are already specified on the node -> No removals allowed.

HCCM only returns the ExternalIP because IPs from a network are only returned if you specify the ID or Name of the network in your configuration. We need this, as a server might be in multiple networks, but only one InternalIP makes sense here.

You can do this by setting the HCLOUD_NETWORK environment variable to the ID or Name of the Network your nodes are attached to.

If you want to run a cluster without public network access, you will need to make some more configuration, as this means that your nodes will node be able to pull images or access the Hetzner Cloud API. If you only want your intra-cluster communication through the private network, that should be enough.

If you also want to use the Routing functionality, you will need to make some more configuration to your CNI & the HCCM manifests. See https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/deploy_with_networks.md

gecube commented 9 months ago

@apricote Hi! Thanks for your considerations. So the only reason could be that I forgot HCLOUD_NETWORK ? It is a little bit weird as I am sure that I set it up in the secret... and I don't remember a relevant error messages in logs. I will make one more experiment to check.

apricote commented 9 months ago

Not sure how you installed HCCM (Yaml Manifests, Helm Chart,..). But this is the related excerpt from the readme:

If you manage the network yourself it might still be required to let the CCM know about private networks. You can do this by adding the environment variable with the network name/ID in the CCM deployment.

         env:
           - name: HCLOUD_NETWORK
             valueFrom:
               secretKeyRef:
                 name: hcloud
                 key: network

You also need to add the network name/ID to the secret: kubectl -n kube-system create secret generic hcloud --from-literal=token=<hcloud API token> --from-literal=network=<hcloud Network_ID_or_Name>.

As far as I remember there is no error message, as its an optional configuration value and nodes may or may not be attached a network that should be used for in-cluster communication. But maybe the attached network is also for another service, proxy, .. so adding a log for whenever no network was configured but the Node has a network has the potential to spam the logs.

We could add a log that is only sent once when no network is configured, but a node with network is processed. Then set some internal variable to "silence" this until the process is restarted.

github-actions[bot] commented 6 months ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

paprickar commented 3 months ago

@apricote @gecube I am facing the exact same issue and i verified the environment variables are set correctly

HCCM version: 1.20.0

Environment variables:

k3s 1.29.5 Cloud Servers with dedicated server connection over vSwitch.

I manually set the hrobot providerID but still getting

error syncing 'SERVER_NAME': failed to get node modifiers from cloud provider: provided node ip for node "SERVER_NAME" is not valid: failed to get node address from cloud provider that matches ip: x.x.x.x, requeuing

Any advice how to fix this?

gecube commented 3 months ago

@paprickar Hi! Thanks for your report. We will check on our side what changed since last observations. Also I want to mention that I am not linked to Hetzner in anyway and I am independent engineer.

klinch0 commented 3 months ago

was usefull for me

         env:
           - name: HCLOUD_NETWORK
             valueFrom:
               secretKeyRef:
                 name: hcloud
                 key: network
klin@asus:~/stav$ kgn -owide
NAME                           STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP      OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
preprod-talos-controlplane-1   Ready    control-plane   10h   v1.30.3   10.0.2.2      XXX   Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
preprod-talos-controlplane-2   Ready    control-plane   10h   v1.30.3   10.0.2.3      XXX    Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
preprod-talos-controlplane-3   Ready    control-plane   10h   v1.30.3   10.0.2.4      XXX     Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
preprod-talos-ingress-1        Ready    <none>          10h   v1.30.3   10.0.2.6      XXX    Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
preprod-talos-ingress-2        Ready    <none>          10h   v1.30.3   10.0.2.5      XXX      Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
preprod-talos-worker-1         Ready    <none>          10h   v1.30.3   10.0.2.7      XXX     Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
maxpain commented 1 month ago

I'm getting the same error on my bare metal cluster (I don't use VMs):

NAME   STATUS   ROLES           AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
s1     Ready    control-plane   28m   v1.31.1   fd00:10:201::11   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
s2     Ready    control-plane   28m   v1.31.1   fd00:10:201::12   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
s3     Ready    control-plane   28m   v1.31.1   fd00:10:201::13   <none>        Talos (v1.7.6)   6.6.43-talos     containerd://1.7.18
I0922 08:02:26.480689       1 node_controller.go:425] Initializing node s1 with cloud provider
I0922 08:02:26.627216       1 node_controller.go:229] error syncing 's1': failed to get node modifiers from cloud provider: provided node ip for node "s1" is not valid: failed to get node address from cloud provider that matches ip: fd00:10:201::11, requeuing
E0922 08:02:26.627235       1 node_controller.go:240] error syncing 's1': failed to get node modifiers from cloud provider: provided node ip for node "s1" is not valid: failed to get node address from cloud provider that matches ip: fd00:10:201::11, requeuing

Any ideas on how to fix that?

My bare metal servers are connected using vSwitch and communicate with each other using private subnets. They also have public IP addresses, but I don't specify them in the nodeIP kubelet field. I use the following Talos configuration:

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - fd00:10:201::/64
        - 10.201.0.0/24
maxpain commented 1 month ago

I'm confused.

https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/robot.md

This article says this:

When a new Node joins the cluster, we first need to figure out which Robot (or Cloud) Server matches this node. We primarily try to match this through the Node Name and the Name of the server in Robot. If you use Kubeadm, the Node Name by default is the Hostname of the server.

This means that by default, your Hostname needs to be the name of the server in Robot. If this does not match, we can not properly match the two entities. Once we have made this connection, we save the Robot Server Number to the field spec.providerId on the Node, and use this identifier for any further processing.

Why does it use IP addresses, then? In my case, kubernetes nodes have the same names as in Robot.

gecube commented 1 month ago

@apricote Hi ! Could we reopen issue? It looks like that there are still some problems with configuration the CCM.

gecube commented 1 month ago

@maxpain Could you kindly show kubectl describe on your nodes? And settings for the CCM? As @klinch0 reported, the proper network should be configured in CM of CCM.

maxpain commented 1 month ago

CCM helm chart configuration:

network:
  enabled: false

robot:
  enabled: true

kubectl describe node s4


Name:               s4
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=s4
                    kubernetes.io/os=linux
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.201.0.14
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 22 Sep 2024 13:20:44 +0300
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  s4
  AcquireTime:     <unset>
  RenewTime:       Sun, 22 Sep 2024 13:21:25 +0300
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Sun, 22 Sep 2024 13:21:02 +0300   Sun, 22 Sep 2024 13:21:02 +0300   CiliumIsUp                   Cilium is running on this node
  MemoryPressure       False   Sun, 22 Sep 2024 13:20:44 +0300   Sun, 22 Sep 2024 13:20:44 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sun, 22 Sep 2024 13:20:44 +0300   Sun, 22 Sep 2024 13:20:44 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sun, 22 Sep 2024 13:20:44 +0300   Sun, 22 Sep 2024 13:20:44 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sun, 22 Sep 2024 13:20:44 +0300   Sun, 22 Sep 2024 13:20:44 +0300   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.201.0.14
  Hostname:    s4
Capacity:
  cpu:                96
  ephemeral-storage:  935981540Ki
  hugepages-2Mi:      0
  memory:             131561828Ki
  pods:               512
Allocatable:
  cpu:                95950m
  ephemeral-storage:  862332150380
  hugepages-2Mi:      0
  memory:             131262820Ki
  pods:               512
System Info:
  Machine ID:                 88f2f7cf81cabf70d0b2fed44d265090
  System UUID:                664ee4d4-bfde-11d3-01a6-9c6b004ef106
  Boot ID:                    bea627da-695f-4b36-8b2a-733bfa117f1d
  Kernel Version:             6.6.43-talos
  OS Image:                   Talos (v1.7.6)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.18
  Kubelet Version:            v1.31.1
  Kube-Proxy Version:         v1.31.1
PodCIDR:                      10.244.4.0/24
PodCIDRs:                     10.244.4.0/24
Non-terminated Pods:          (2 in total)
  Namespace                   Name                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                  ------------  ----------  ---------------  -------------  ---
  kube-system                 cilium-envoy-cbbcb    0 (0%)        0 (0%)      0 (0%)           0 (0%)         45s
  kube-system                 cilium-lphsr          100m (0%)     0 (0%)      10Mi (0%)        0 (0%)         45s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                100m (0%)  0 (0%)
  memory             10Mi (0%)  0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
Events:
  Type     Reason                   Age                  From             Message
  ----     ------                   ----                 ----             -------
  Normal   Starting                 4m10s                kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      4m10s                kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  4m9s (x2 over 4m9s)  kubelet          Node s4 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    4m9s (x2 over 4m9s)  kubelet          Node s4 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     4m9s (x2 over 4m9s)  kubelet          Node s4 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  4m9s                 kubelet          Updated Node Allocatable limit across pods
  Normal   NodeReady                3m48s                kubelet          Node s4 status is now: NodeReady
  Normal   NodeAllocatableEnforced  102s                 kubelet          Updated Node Allocatable limit across pods
  Warning  InvalidDiskCapacity      102s                 kubelet          invalid capacity 0 on image filesystem
  Normal   Starting                 102s                 kubelet          Starting kubelet.
  Normal   NodeHasSufficientPID     77s (x7 over 102s)   kubelet          Node s4 status is now: NodeHasSufficientPID
  Normal   NodeHasSufficientMemory  77s (x8 over 102s)   kubelet          Node s4 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    77s (x7 over 102s)   kubelet          Node s4 status is now: NodeHasNoDiskPressure
  Normal   Starting                 45s                  kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      45s                  kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  45s (x2 over 45s)    kubelet          Node s4 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    45s (x2 over 45s)    kubelet          Node s4 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     45s (x2 over 45s)    kubelet          Node s4 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  45s                  kubelet          Updated Node Allocatable limit across pods
  Normal   NodeReady                45s                  kubelet          Node s4 status is now: NodeReady
  Normal   RegisteredNode           40s                  node-controller  Node s4 event: Registered Node s4 in Controller
maxpain commented 1 month ago

Well, I want LoadBalancer to work on top of Bare Metal servers using private IP addresses via vSwitch instead of using the external IP addresses of bare metal servers.

apricote commented 1 month ago

We do not officially support Robot Servers on vSwitches / with private networks in HCCM at this time.

As the vSwitches are operating at Layer 2 and do not provide info through an API, it is not possible (or at the very least very hard) to get the information (like the IP addresses of servers connected to the vSwitch) we need.

The initial matching of the Kubernetes Nodes with the servers in Robot/Cloud APIs happens through the hostname. In a follow up step https://github.com/kubernetes/cloud-provider errors if the NodeAddresses we return do not match the addresses that are already configured on the Node object. This is an upstream issue and out of our direct control.


Unofficially you can skip/disable the node controller which logs the error and set the provider id, node addresses and annotations yourself. The other controllers will then still work with the data that was provided to them.

gecube commented 1 month ago

@apricote Hi! Thanks for the answer. Are there any plans to introduce such a support?