hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
695 stars 109 forks source link

Calico and HCC #641

Open medicol69 opened 3 months ago

medicol69 commented 3 months ago

TL;DR

This is more of an inquiry, since it's not that clear from the documentation, does the hetzner cloud controller work with the Calico CNI when using the private interfaces on Hetzner? Thanks

Expected behavior

this is an inquiry on the documentation.

apricote commented 3 months ago

When you use the private networks from Hetzner Cloud with hcloud-cloud-controller-manager and enable the routes-controller (default), then you should be able to use Calico without any additional overlay networks. You can configure this in Calico with CALICO_NETWORKING_BACKEND=none

I have never personally tested this configuration though.

simonostendorf commented 2 months ago

I am also interested in this topic, if you have any knowledge @medicol69 please let me now :)

DeprecatedLuke commented 2 months ago

Yes, it works fine with calico. To run a quick test use hetzner-k3s.

Important warning when running cloud together when baremetal with private networking. Calico requires a /24 vlan address per node which means when you're creating a subnet make sure the vlan subnet is at minimum a /23 (1 nodes max) or ideally /17 (127 nodes max) allocating first half to cloud instances and the second half to baremetal instances.

medicol69 commented 2 months ago

thanks, but I don't think that the hetzner private network interfaces are stable enough to use them in production. If anyone got them to work and give out an example of how to use it in prod I'm all ears.

DeprecatedLuke commented 2 months ago

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

simonostendorf commented 2 months ago

I am using calico without encapsulation and hccm with routes enabled. Calico uses BPF and replaces kube-proxy.

I think this works well, but I haven't tested it enough to be 100% sure.

If you have any feedback on this configuration, I would love to discuss it :)

calico-tigera-operator-values.yaml

installation:
  cni:
    type: Calico
    ipam:
      type: HostLocal # use podCIDR assigned by kube-controller-manager, that is also used by route-controller in hcloud-cloud-controller-manager
  calicoNetwork:
    bgp: Enabled
    linuxDataplane: BPF
    hostPorts: Disabled
    ipPools:
      - name: default-ipv4
        cidr: 10.0.0.0/16
        encapsulation: None
        blockSize: 24
        natOutgoing: Enabled
        nodeSelector: all()
defaultFelixConfiguration:
  enabled: true
  bpfEnabled: true
  bpfExternalServiceMode: DSR
  bpfKubeProxyIptablesCleanupEnabled: true
kubernetesServiceEndpoint:
  host: api.my-cluster.domain.tld
  port: 6443
DeprecatedLuke commented 2 months ago

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

I spent few hours trying to figure out why without being able to find any difference between the two configurations. My only guess is that it is some internal order of configuration where the metadata/private network endpoints are not being parsed in order.

So to recap: allocate at least /16 vlan range and do not use the hcloud controller (will not be able to use the load balancer or resolve labels automatically).

simonostendorf commented 2 months ago

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

What kubernetes version do you use? Kubernetes 1.29 had a change that the node ip will be left empty if cloud-provider is set to external and --node-ip is not set manually. Maybe this is the case here.

From CHANGELOG-1.29: kubelet , when using --cloud-provider=external, will now initialize the node addresses with the value of --node-ip , if it exists, or waits for the cloud provider to assign the addresses. (https://github.com/kubernetes/kubernetes/pull/121028, [@aojea](https://github.com/aojea))

medicol69 commented 2 months ago

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

simonostendorf commented 2 months ago

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

I am currently testing this. You can see my calico values above. HCCM configuration is normal with networks enabled.

DeprecatedLuke commented 2 months ago

I am not sure why, but when using hetzner-k3s the internal network works just fine, however, a manually bootstrapped cluster has an issue with the cloud controller where it does not recognize the internal ip address so it never gets the taint removed and the labels added.

What kubernetes version do you use? Kubernetes 1.29 had a change that the node ip will be left empty if cloud-provider is set to external and --node-ip is not set manually. Maybe this is the case here.

From CHANGELOG-1.29: kubelet , when using --cloud-provider=external, will now initialize the node addresses with the value of --node-ip , if it exists, or waits for the cloud provider to assign the addresses. (https://github.com/kubernetes/kubernetes/pull/121028, [@aojea](https://github.com/aojea))

I tried both 1.29 and 1.30, here's my init script:

k3sup install --host $SERVER_HOST --ip $PUBLIC_IP --user root --ssh-key=~/.ssh/id_ed25519 --cluster --local-path ~/.kube/config --merge --context $CLUSTER --no-extras --k3s-channel latest --k3s-extra-args "\
--disable local-storage \
--disable metrics-server \
--disable-cloud-controller \
--kubelet-arg='provider-id=hcloud://$PROVIDER_ID' \
--kubelet-arg='cloud-provider=external' \
--flannel-backend=none \
--disable-network-policy \
--write-kubeconfig-mode=644 \
--cluster-domain=$CLUSTER_DOMAIN \
--cluster-cidr=$CLUSTER_CIDR \
--service-cidr=$CLUSTER_SERVICE_CIDR \
--cluster-dns=$CLUSTER_DNS \
--node-name=$SERVER_HOSTNAME \
--node-ip=$PRIVATE_IP \
--node-external-ip=$PUBLIC_IP \
--tls-san=$CLUSTER_LB \
--tls-san=$PRIVATE_IP \
--tls-san=$PUBLIC_IP \
--tls-san=$CLUSTER_DOMAIN \
--node-taint=CriticalAddonsOnly=true:NoExecute \
--etcd-expose-metrics='true' \
--kube-controller-manager-arg='bind-address=0.0.0.0' \
--kube-proxy-arg='metrics-bind-address=0.0.0.0' \
--kube-scheduler-arg='bind-address=0.0.0.0' \
" --print-command

EDIT: added node-ip=$PRIVATE_IP, the configuration before is what I am currently using to get around the issue.

I am currently running it just fine with calico and even have ceph working over vlan with pretty good performance. You cannot advertise nodeip with internal so define hostendpoint instead for metrics and etcd to be protected. Load balancers also require you to use public net in this case.

I was thinking on private networking on hetzner, if anyone is doing that in production please share your config, and what are your experiences.

Yes, it does work including networking and routes out of the box when using hetzner-k3s tool. But I had issues with getting HCCM to recognize the nodes when defining an internal ip as the node network when attempting to bootstrap the cluster manually. However, using the public ip works fine (and routes are still created for internal communication). Robot does not support networking from HCCM.

simonostendorf commented 2 months ago

Yes, it does work including networking and routes out of the box when using hetzner-k3s tool. But I had issues with getting HCCM to recognize the nodes when defining an internal ip as the node network when attempting to bootstrap the cluster manually. However, using the public ip works fine (and routes are still created for internal communication). Robot does not support networking from HCCM.

I am using kubeadm only on hcloud nodes (currently no dedicated / robot nodes, maybe i will add them later) and this works fine.

DeprecatedLuke commented 2 months ago

Alright, here's the full guide to replicate the issue: init_master.sh

#!/bin/bash

CLUSTER=$1
CLUSTER_DOMAIN=$2
SERVER_HOST=$3
CLUSTER_PRIVATE_NET=$4
CLUSTER_CIDR=$5
CLUSTER_SERVICE_CIDR=$6
CLUSTER_DNS=$7
CLUSTER_LB=$8

PUBLIC_IP=$(ssh $SERVER_HOST "curl checkip.amazonaws.com")
PRIVATE_IP=$(ssh $SERVER_HOST "ip route get $CLUSTER_PRIVATE_NET | awk '{print \$7}'")
PROVIDER_ID=$(ssh $SERVER_HOST "curl http://169.254.169.254/hetzner/v1/metadata/instance-id")

echo "Public IP: $PUBLIC_IP Private IP: $PRIVATE_IP"

kubectl config delete-cluster $CLUSTER
kubectl config delete-user $CLUSTER

SERVER_HOSTNAME=$(echo $SERVER_HOST | cut -d'.' -f1)

ssh -y $SERVER_HOST "curl https://packages.hetzner.com/hcloud/deb/hc-utils_0.0.4-1_all.deb -o /tmp/hc-utils_0.0.3-1_all.deb -s && apt -y install /tmp/hc-utils_0.0.3-1_all.deb"

k3sup install --host $SERVER_HOST --ip $PUBLIC_IP --user root --ssh-key=~/.ssh/id_ed25519 --cluster --local-path ~/.kube/config --merge --context $CLUSTER --no-extras --k3s-channel latest --k3s-extra-args "\
--disable local-storage \
--disable metrics-server \
--disable-cloud-controller \
--kubelet-arg='provider-id=hcloud://$PROVIDER_ID' \
--kubelet-arg='cloud-provider=external' \
--flannel-backend=none \
--disable-network-policy \
--write-kubeconfig-mode=644 \
--cluster-domain=$CLUSTER_DOMAIN \
--cluster-cidr=$CLUSTER_CIDR \
--service-cidr=$CLUSTER_SERVICE_CIDR \
--cluster-dns=$CLUSTER_DNS \
--node-name=$SERVER_HOSTNAME \
--node-ip=$PRIVATE_IP \
--node-external-ip=$PUBLIC_IP \
--tls-san=$CLUSTER_LB \
--tls-san=$PRIVATE_IP \
--tls-san=$PUBLIC_IP \
--tls-san=$CLUSTER_DOMAIN \
--node-taint=CriticalAddonsOnly=true:NoExecute \
--etcd-expose-metrics='true' \
--kube-controller-manager-arg='bind-address=0.0.0.0' \
--kube-proxy-arg='metrics-bind-address=0.0.0.0' \
--kube-scheduler-arg='bind-address=0.0.0.0' \
" --print-command

kubectl config set-cluster $CLUSTER --server=https://$CLUSTER_LB:6443
k3sup ready --context $CLUSTER <- will fail since no CNI

bash init_master.sh test-cluster cluster.local IP_ADDRESS 10.224.0.0 10.222.0.0/16 10.223.0.0/16 10.223.0.10 IP_ADDRESS

kubectl config set-context test-cluster

Install calico: helm repo add tiegra https://docs.tigera.io/calico/charts helm repo update tiegra helm install cni tiegra/tigera-operator -n tiegra-operator

Create HCCM secret with the network cidr and hcloud token.

Install hcloud: helm repo add hcloud https://charts.hetzner.cloud helm repo update hcloud helm install hccm hcloud/hcloud-cloud-controller-manager -n kube-system --values values.yaml

nodeSelector:
  node-role.kubernetes.io/control-plane: "true"

Observe the following error:

error syncing '*node*': failed to get node modifiers from cloud provider: provided node ip for node "*node*" is not valid: failed to get node address from cloud provider that matches ip: 10.224.0.2, requeuing

edit: the actual name doesn't matter for the hostname since providerid is specified, usually the hostname would be a domain matching the name of the node and the calico step is optional.

simonostendorf commented 2 months ago

If you see failed to get node address from cloud provider that matches ip: 10.x.x.x, requeuing you have to enable routes-controller with network.enabled=true.

DeprecatedLuke commented 2 months ago

If you see failed to get node address from cloud provider that matches ip: 10.x.x.x, requeuing you have to enable routes-controller with network.enabled=true.

Ah, that makes sense! You can't enable robot & network at the same time (refuses to start). However, if you change the label to get it to load it does work fine so it's still a weird limitation.

simonostendorf commented 2 months ago

What needs to be done to enable route controllers with robot support?

Is this generally supported by the underlying network and does the support need to be implemented in the hccm or are there any changes required to the Hetzner Cloud network?

(see https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/robot.md#unsupported)

Edit: We can move this to a new issue if needed, I am interested in this feature and could try to implement (parts of) it.

DeprecatedLuke commented 2 months ago

What needs to be done to enable route controllers with robot support?

Is this generally supported by the underlying network and does the support need to be implemented in the hccm or are there any changes required to the Hetzner Cloud network?

(see https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/robot.md#unsupported)

Edit: We can move this to a new issue if needed, I am interested in this feature and could try to implement (parts of) it.

As far as I know the routes table in the network configuration is not compatible with vSwitch.

simonostendorf commented 2 months ago

As far as I know the routes table in the network configuration is not compatible with vSwitch.

But I think it should be possible to use private ip addresses for the nodes (so this currently needs route controller enabled) and vswitch WITHOUT cidr routing.

DeprecatedLuke commented 2 months ago

As far as I know the routes table in the network configuration is not compatible with vSwitch.

But I think it should be possible to use private ip addresses for the nodes (so this currently needs route controller enabled) and vswitch WITHOUT cidr routing.

Yep, it's possible (with calico at least in VXLANCrossSubnet configuration). I've hacked it to recognize the nodes by setting the label alpha.kubernetes.io/provided-node-ip which was working for a short while before it got updated to the real one and broke pod scheduling.

simonostendorf commented 2 months ago

As far as I know the routes table in the network configuration is not compatible with vSwitch.

I found the following configuration: https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/network#expose_routes_to_vswitch. This tells me that routes should be possible with vSwitch connected servers.

DeprecatedLuke commented 2 months ago

As far as I know the routes table in the network configuration is not compatible with vSwitch.

I found the following configuration: https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/network#expose_routes_to_vswitch. This tells me that routes should be possible with vSwitch connected servers.

It's the option here: https://luk.cat/24/9L1WVG.png, but they are not assignable which is required for cni's to function: https://luk.cat/24/LJtyTr.png

apricote commented 2 weeks ago

The main problem with Robot & Routing is, that there is no way to get the private IPs of the Robot server through the API (see #676 for an example).

IIUC there is also no way to have a Route with the Gateway being a private IP of a Robot server behind the vswitch.


It is possible to get the private IP info on the Cloud Servers without using the Routes feature. You need to set HCLOUD_NETWORK_ROUTES_ENABLED=false in the env variables. This will also work when enabling the Robot support.

olexiyb commented 1 week ago

But is it possible to skip check for robot nodes? I do have HCLOUD_NETWORK_ROUTES_ENABLED=false

These errors are very annoying

2024-08-14T14:38:04.532676333Z E0814 14:38:04.532366       1 node_controller.go:389] Failed to update node addresses for node "scd1": failed to get node address from cloud provider that matches ip: 10.100.0.2
2024-08-14T14:38:04.533375857Z E0814 14:38:04.533291       1 node_controller.go:389] Failed to update node addresses for node "scd2": failed to get node address from cloud provider that matches ip: 10.100.0.3
2024-08-14T14:38:04.534489902Z E0814 14:38:04.534347       1 node_controller.go:389] Failed to update node addresses for node "scd3": failed to get node address from cloud provider that matches ip: 10.100.0.4
2024-08-14T14:43:07.024394731Z E0814 14:43:07.022342       1 node_controller.go:389] Failed to update node addresses for node "scd2": failed to get node address from cloud provider that matches ip: 10.100.0.3
2024-08-14T14:43:07.024800413Z E0814 14:43:07.024527       1 node_controller.go:389] Failed to update node addresses for node "scd3": failed to get node address from cloud provider that matches ip: 10.100.0.4
2024
apricote commented 1 week ago

But is it possible to skip check for robot nodes?

Which check are you talking about? Do you have Robot nodes in your cluster and robot.enabled: true (Helm) or ROBOT_ENABLED=true (Env Var) set in your deployment?