Cloud-Controller w/network (native routing) does not create correct routes

ByteAlex commented 3 years ago

Hello,

I've been playing around with kubernetes 1.19 on hcloud for a bit now. Since the documentation about this is pretty old, I've been mostly trying to figure it on my own.

So my current setup: 1x Network / 10.0.0.0/8 1x LB (for a later HA setup of the control-planes, 10.0.0.5 here) 1x CPX11 (control-plane) 2x CPX11 (worker nodes)

Using kubeadm to setup the kubernetes cluster:

kubeadm init --ignore-preflight-errors=NumCPU --apiserver-cert-extra-sans $API_SERVER_CERT_EXTRA_SANS --control-plane-endpoint "$CONTROL_PLANE_LB" \
  --upload-certs --kubernetes-version=$KUBE_VERSION --pod-network-cidr=$POD_NETWORK_CIDR

with the following variables: API_SERVER_CERT_EXTRA_SANS=10.0.0.1 CONTROL_PLANE_LB=10.0.0.5 KUBE_VERSION=v1.19.0 POD_NETWORK_CIDR=10.224.0.0/16

After that I copy the kube config and create the secrets for the hetzner ccm like this:

apiVersion: v1
kind: Secret
metadata:
  name: hcloud
  namespace: kube-system
stringData:
  token: "<hetzner_api_token>"
  network: "<hetzner_network_id>"
---
apiVersion: v1
kind: Secret
metadata:
  name: hcloud-csi
  namespace: kube-system
stringData:
  token: "<hetzner_api_token>"

Followed by that I deploy the CCM-network:

kubectl apply -f https://raw.githubusercontent.com/hetznercloud/hcloud-cloud-controller-manager/master/deploy/ccm-networks.yaml

The cloud controller goes ready, the nodes do have the hcloud://serverid in their describe.

Now I deploy the latest cilium with a few tweaked parameters:

wget https://raw.githubusercontent.com/cilium/cilium/1.9.0/install/kubernetes/quick-install.yaml

Edit the quick-install.yml and ensure the following parameters:

tunnel: disabled
masquerade: "true"
enable-endpoint-routes: "true"
native-routing-cidr: "10.0.0.0/8"
cluster-pool-ipv4-cidr: "10.224.0.0/16"

Apply the deployment file.

Now the CNI is installed, coredns should start scheduling and the CCM creates routes for the nodes. So far so good, yet the created routes seem to be wrong for me.

grafik

Seeing here:

10.224.0.0/24 routes to 10.0.0.2 (master-01)
10.224.1.0/24 routes to 10.0.0.3 (worker-01)
10.224.2.0/24 routes to 10.0.0.4 (worker-02)

Yet the kubectl get pods -A -owide shows different ip distribution:

root@test-cluster-master-01:~# k get pods -A -owide
NAMESPACE     NAME                                              READY   STATUS    RESTARTS   AGE   IP               NODE                     NOMINATED NODE   READINESS GATES
kube-system   cilium-dkwbl                                      1/1     Running   0          46m   10.0.0.3         test-cluster-worker-01   <none>           <none>
kube-system   cilium-g7whv                                      1/1     Running   0          46m   10.0.0.2         test-cluster-master-01   <none>           <none>
kube-system   cilium-k4tww                                      1/1     Running   0          46m   10.0.0.4         test-cluster-worker-02   <none>           <none>
kube-system   coredns-f9fd979d6-6l8nx                           0/1     Running   0          48m   10.224.0.101     test-cluster-worker-01   <none>           <none>
kube-system   coredns-f9fd979d6-q7dz4                           0/1     Running   0          48m   10.224.1.157     test-cluster-worker-02   <none>           <none>

Where you can see:

10.224.0.101 is scheduled on test-cluster-worker-01 which, according to the routes in cloud console, should have 10.224.1.0/24.
10.224.1.157 is scheduled on test-cluster-worker-02 which should have 10.224.2.0/24

Can someone please pinpoint me into the correct direction for resolving this issue?

MatthiasLohr commented 3 years ago

Crazy. Didn't have that yet. You can still a) use another network plugin (for me, flannel works fine) or b) completely ignore the Hetzner network routing and use e.g. IPIP/vxlan encapsulation (didn't try that yet on hcloud).

Not sure if related: When I used cilium in hcloud environments, it showed to me a huge number of restarts (couple of hundred restarts per week). That's the reason why I went away from cilium, could not figure out if it's cilium or a hcloud issue.

ByteAlex commented 3 years ago

Just clarifying - In your issue #112 you were referring to the ccm-network documentation, which basically fully depends and takes use of the hetzner routing, but you still use another tunneling CNI?

Someone correct me if I am wrong, but if you use another tunneling CNI you can just go by the normal CCM w/o networks support, then you do not have to mess around with Cilium anyway if you're successfully using Flannel.

I think the issue happens to be on hcloud's end since someone at CIlium already looked into this issue, they said the routing magic in HCloud might be wrong.

MatthiasLohr commented 3 years ago

You can basically do these different flavors:

No Hetzner Networks, just a networking plugin which does routing and tunneling using the public interface
Hetzner Networks with a network plugin, which does use the internal network for communication, but does it's own routing (of pod subnets) in its own tunnel
Hetzner Networks with a network plugin, which uses the internal network for communication as well as tunneling

I'm doing the third variant. You can to that with different plugins, e.g. cilium (native-routing-cidr), flannel (backend type "alloc"), cilium (no IPIP, no vxlan).

ByteAlex commented 3 years ago

Okay got that, anyway I'm trying to figure out how to configure cilium/hcloud to create the correct routing tables in the cloud console. 👀

LKaemmerling commented 3 years ago

The cloud controller gets the values from k8s (or better said from the controlling CNI plugin). The cloud controller just adds these routes.

I‘m currently just at my mobile phone, so I can point you do a direct location, but if I remember correctly you need to set a specific cillium configuration, something with „blacklisted-routes“ (I try to find it, give me a few minutes).

Edit found the specific configuration lines: https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/44#issuecomment-652246804

The problem is that we won’t recommend or support a specific CNI. We just implement the spec given from k8s.

MatthiasLohr commented 3 years ago

@LKaemmerling I guess you mean blacklist-conflicting-routes: "false"?

edit: Nevermind, you just edited your comment.

MatthiasLohr commented 3 years ago

I guess that many people will have similar questions to these... to have a more persistent solution than just tickets and comments:

https://gitlab.com/MatthiasLohr/hcloud-cloud-controller-manager-helm-chart/-/blob/master/docs/networking.md

@LKaemmerling, completely off topic: Maybe it's worth thinking about of linking this somewhere here.

ByteAlex commented 3 years ago

The cloud controller gets the values from k8s (or better said from the controlling CNI plugin). The cloud controller just adds these routes.

I‘m currently just at my mobile phone, so I can point you do a direct location, but if I remember correctly you need to set a specific cillium configuration, something with „blacklisted-routes“ (I try to find it, give me a few minutes).

Edit found the specific configuration lines: #44 (comment)

The problem is that we won’t recommend or support a specific CNI. We just implement the spec given from k8s.

@LKaemmerling This sounds about right. I see routes being created as the cilium nodes allocate subnets, yet the routing seems to be wrong.

See this:

Spec:
  Addresses:
    Ip:    138.201.94.89
    Type:  ExternalIP
    Ip:    10.0.0.2
    Type:  InternalIP
    Ip:    10.224.2.153
    Type:  CiliumInternalIP
  Azure:
  Encryption:
  Eni:
  Health:
    ipv4:  10.224.2.92
  Ipam:
    Pod CID Rs:
      10.224.2.0/24

10.224.2.0/24 should be routed to 10.0.0.2 but is 10.0.0.4 in HCloud console.

Clarification: The IP allocation from cilium works fine, the cilium configuration seems fine, the only thing seems to be that the routing is wrong. grafik

LKaemmerling commented 3 years ago

@ByteAlex could you try to use this cilium config? https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/master/e2etests/templates/cilium.yml

We use this config within out e2e tests to test the functionality of the whole networks feature. I just created a new setup with this config and it works fine:

root@srv-local-2580907693548082956:~# k describe node srv-local-2580907693548082956
Name:               srv-local-2580907693548082956
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=cpx21
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=fsn1
                    failure-domain.beta.kubernetes.io/zone=fsn1-dc14
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=srv-local-2580907693548082956
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=cpx21
                    topology.kubernetes.io/region=fsn1
                    topology.kubernetes.io/zone=fsn1-dc14
Annotations:        io.cilium.network.ipv4-cilium-host: 10.244.0.123
                    io.cilium.network.ipv4-health-ip: 10.244.0.47
                    io.cilium.network.ipv4-pod-cidr: 10.244.0.0/24
CreationTimestamp:  Thu, 12 Nov 2020 07:16:38 +0100
Taints:             <none>
Unschedulable:      false
Addresses:
  Hostname:    srv-local-2580907693548082956
  ExternalIP:  168.119.154.106
  InternalIP:  10.0.0.2
System Info:
  Kernel Version:             5.4.0-52-generic
  OS Image:                   Ubuntu 20.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.8
  Kubelet Version:            v1.19.4
  Kube-Proxy Version:         v1.19.4
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
ProviderID:                   hcloud://8493791

You can find the cluster configuration here: https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/master/e2etests/templates/cloudinit.txt.tpl#L13 Bildschirmfoto 2020-11-12 um 07 30 12

ByteAlex commented 3 years ago

I just compared the cloudinit.txt to my setup and it seems to be about the same. After the end of this script I just apply the ccm-network.yml for the hcloud ccm, then the cilium cni.

When I try to apply the template file you sent I get an error:

root@test-cluster-master-01:~# k apply -f https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/master/e2etests/templates/cilium.yml
error: error parsing https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/master/e2etests/templates/cilium.yml: error converting YAML to JSON: yaml: line 147: mapping values are not allowed in this context

Anyway in my latest test I got the correct routes (by accident?)

Addresses:
  Hostname:    test-cluster-master-01
  ExternalIP:  138.201.94.89
  InternalIP:  10.0.0.2
PodCIDR:                      10.224.0.0/24
PodCIDRs:                     10.224.0.0/24
ProviderID:                   hcloud://8446735

Addresses:
  Hostname:    test-cluster-worker-01
  ExternalIP:  49.12.44.179
  InternalIP:  10.0.0.3
PodCIDR:                      10.224.1.0/24
PodCIDRs:                     10.224.1.0/24
ProviderID:                   hcloud://8443725

Addresses:
  Hostname:    test-cluster-worker-02
  ExternalIP:  138.201.93.167
  InternalIP:  10.0.0.4
PodCIDR:                      10.224.2.0/24
PodCIDRs:                     10.224.2.0/24
ProviderID:                   hcloud://8443726

grafik

Yet still somewhat in the routing is wrong, when creating with the latest quick-install.yml from 1.9 with a modified configuration. I restarted coredns and api-server pods, yet coredns still does not become ready as it can't reach kubernetes service in the cluster.

root@test-cluster-master-01:~# k get pods -A -owide
NAMESPACE     NAME                                              READY   STATUS    RESTARTS   AGE     IP               NODE                     NOMINATED NODE   READINESS GATES
kube-system   cilium-db2wd                                      1/1     Running   0          2m28s   10.0.0.4         test-cluster-worker-02   <none>           <none>
kube-system   cilium-operator-5d8498fc44-lkmsg                  1/1     Running   0          2m28s   10.0.0.4         test-cluster-worker-02   <none>           <none>
kube-system   cilium-operator-5d8498fc44-skvdl                  1/1     Running   0          2m28s   10.0.0.3         test-cluster-worker-01   <none>           <none>
kube-system   cilium-shkdm                                      1/1     Running   0          2m28s   10.0.0.3         test-cluster-worker-01   <none>           <none>
kube-system   cilium-vwrxm                                      1/1     Running   0          2m28s   10.0.0.2         test-cluster-master-01   <none>           <none>
kube-system   coredns-f9fd979d6-n6ls6                           0/1     Running   0          67s     10.224.1.218     test-cluster-worker-01   <none>           <none>
kube-system   coredns-f9fd979d6-r6gpr                           0/1     Running   0          67s     10.224.0.217     test-cluster-worker-02   <none>           <none>
kube-system   etcd-test-cluster-master-01                       1/1     Running   0          19m     138.201.94.89    test-cluster-master-01   <none>           <none>
kube-system   hcloud-cloud-controller-manager-cb9c6698d-mmd97   1/1     Running   0          19m     138.201.94.89    test-cluster-master-01   <none>           <none>
kube-system   kube-apiserver-test-cluster-master-01             1/1     Running   0          18s     138.201.94.89    test-cluster-master-01   <none>           <none>
kube-system   kube-controller-manager-test-cluster-master-01    1/1     Running   0          19m     138.201.94.89    test-cluster-master-01   <none>           <none>
kube-system   kube-proxy-8bwc6                                  1/1     Running   0          17m     138.201.93.167   test-cluster-worker-02   <none>           <none>
kube-system   kube-proxy-tvs4h                                  1/1     Running   0          19m     138.201.94.89    test-cluster-master-01   <none>           <none>
kube-system   kube-proxy-wg54p                                  1/1     Running   0          17m     49.12.44.179     test-cluster-worker-01   <none>           <none>
kube-system   kube-scheduler-test-cluster-master-01             1/1     Running   0          19m     138.201.94.89    test-cluster-master-01   <none>           <none>

Though from worker-01 I can reach kubernetes via curl grafik

Any idea what could cause this?

Addtional information: coredns can reach kubernetes service when it's scheduled on the same node (master-01).

LKaemmerling commented 3 years ago

@ByteAlex i gave you the link to the github rendered file, not the raw file, this is why it failed. I talked with that about our DevOps and we both think that you misconfigured cilium. Have a look at the config file: (Now the raw applyable file): https://raw.githubusercontent.com/hetznercloud/hcloud-cloud-controller-manager/master/e2etests/templates/cilium.yml

ByteAlex commented 3 years ago

@LKaemmerling thank you for clarifing! I was checking with the Cilium team before and they claimed everything is alright from their end, it would most likely be an issue on hetzner's side. Anyway I was not home for the past few days and will perform a check with the configuration you provided somewhat tonight or tomorrow.

The configuration I am/was using is this:

root@test-cluster-master-01:/opt/k8s/provisioning# cat cilium.yaml
---
# Source: cilium/templates/cilium-agent-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium
  namespace: kube-system
---
# Source: cilium/templates/cilium-operator-serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cilium-operator
  namespace: kube-system
---
# Source: cilium/templates/cilium-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:

  # Identity allocation mode selects how identities are shared between cilium
  # nodes by setting how they are stored. The options are "crd" or "kvstore".
  # - "crd" stores identities in kubernetes as CRDs (custom resource definition).
  #   These can be queried with:
  #     kubectl get ciliumid
  # - "kvstore" stores identities in a kvstore, etcd or consul, that is
  #   configured below. Cilium versions before 1.6 supported only the kvstore
  #   backend. Upgrades from these older cilium versions should continue using
  #   the kvstore by commenting out the identity-allocation-mode below, or
  #   setting it to "kvstore".
  identity-allocation-mode: crd
  cilium-endpoint-gc-interval: "5m0s"

  # If you want to run cilium in debug mode change this value to true
  debug: "false"

  # Enable IPv4 addressing. If enabled, all endpoints are allocated an IPv4
  # address.
  enable-ipv4: "true"

  # Enable IPv6 addressing. If enabled, all endpoints are allocated an IPv6
  # address.
  enable-ipv6: "false"
  # Users who wish to specify their own custom CNI configuration file must set
  # custom-cni-conf to "true", otherwise Cilium may overwrite the configuration.
  custom-cni-conf: "false"
  enable-bpf-clock-probe: "true"
  # If you want cilium monitor to aggregate tracing for packets, set this level
  # to "low", "medium", or "maximum". The higher the level, the less packets
  # that will be seen in monitor output.
  monitor-aggregation: medium

  # The monitor aggregation interval governs the typical time between monitor
  # notification events for each allowed connection.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-interval: 5s

  # The monitor aggregation flags determine which TCP flags which, upon the
  # first observation, cause monitor notifications to be generated.
  #
  # Only effective when monitor aggregation is set to "medium" or higher.
  monitor-aggregation-flags: all
  # Specifies the ratio (0.0-1.0) of total system memory to use for dynamic
  # sizing of the TCP CT, non-TCP CT, NAT and policy BPF maps.
  bpf-map-dynamic-size-ratio: "0.0025"
  # bpf-policy-map-max specifies the maximum number of entries in endpoint
  # policy map (per endpoint)
  bpf-policy-map-max: "16384"
  # bpf-lb-map-max specifies the maximum number of entries in bpf lb service,
  # backend and affinity maps.
  bpf-lb-map-max: "65536"
  # Pre-allocation of map entries allows per-packet latency to be reduced, at
  # the expense of up-front memory allocation for the entries in the maps. The
  # default value below will minimize memory usage in the default installation;
  # users who are sensitive to latency may consider setting this to "true".
  #
  # This option was introduced in Cilium 1.4. Cilium 1.3 and earlier ignore
  # this option and behave as though it is set to "true".
  #
  # If this value is modified, then during the next Cilium startup the restore
  # of existing endpoints and tracking of ongoing connections may be disrupted.
  # As a result, reply packets may be dropped and the load-balancing decisions
  # for established connections may change.
  #
  # If this option is set to "false" during an upgrade from 1.3 or earlier to
  # 1.4 or later, then it may cause one-time disruptions during the upgrade.
  preallocate-bpf-maps: "false"

  # Regular expression matching compatible Istio sidecar istio-proxy
  # container image names
  sidecar-istio-proxy-image: "cilium/istio_proxy"

  # Encapsulation mode for communication between nodes
  # Possible values:
  #   - disabled
  #   - vxlan (default)
  #   - geneve
  tunnel: disabled

  # Name of the cluster. Only relevant when building a mesh of clusters.
  cluster-name: default
  # Enables L7 proxy for L7 policy enforcement and visibility
  enable-l7-proxy: "true"

  # wait-bpf-mount makes init container wait until bpf filesystem is mounted
  wait-bpf-mount: "false"

  masquerade: "true"
  enable-bpf-masquerade: "true"

  enable-xt-socket-fallback: "true"
  install-iptables-rules: "true"

  auto-direct-node-routes: "false"
  enable-bandwidth-manager: "false"
  enable-local-redirect-policy: "false"
  kube-proxy-replacement:  "probe"
  kube-proxy-replacement-healthz-bind-address: ""
  enable-health-check-nodeport: "true"
  node-port-bind-protection: "true"
  enable-auto-protect-node-port-range: "true"
  enable-session-affinity: "true"
  enable-endpoint-health-checking: "true"
  enable-health-checking: "true"
  enable-well-known-identities: "false"
  enable-remote-node-identity: "true"
  operator-api-serve-addr: "127.0.0.1:9234"
  # Enable Hubble gRPC service.
  enable-hubble: "true"
  # UNIX domain socket for Hubble server to listen to.
  hubble-socket-path:  "/var/run/cilium/hubble.sock"
  ipam: "cluster-pool"
  cluster-pool-ipv4-cidr: 10.224.0.0/16
  cluster-pool-ipv4-mask-size: "24"
  disable-cnp-status-updates: "true"

  #Inserted Configuration
  native-routing-cidr: 10.0.0.0/8
  enable-endpoint-routes: "true"
---
# Source: cilium/templates/cilium-agent-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium
rules:
- apiGroups:
  - networking.k8s.io
  resources:
  - networkpolicies
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - namespaces
  - services
  - nodes
  - endpoints
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  - pods/finalizers
  verbs:
  - get
  - list
  - watch
  - update
  - delete
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
  - update
- apiGroups:
  - ""
  resources:
  - nodes
  - nodes/status
  verbs:
  - patch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  # Deprecated for removal in v1.10
  - create
  - list
  - watch
  - update

  # This is used when validating policies in preflight. This will need to stay
  # until we figure out how to avoid "get" inside the preflight, and then
  # should be removed ideally.
  - get
- apiGroups:
  - cilium.io
  resources:
  - ciliumnetworkpolicies
  - ciliumnetworkpolicies/status
  - ciliumnetworkpolicies/finalizers
  - ciliumclusterwidenetworkpolicies
  - ciliumclusterwidenetworkpolicies/status
  - ciliumclusterwidenetworkpolicies/finalizers
  - ciliumendpoints
  - ciliumendpoints/status
  - ciliumendpoints/finalizers
  - ciliumnodes
  - ciliumnodes/status
  - ciliumnodes/finalizers
  - ciliumidentities
  - ciliumidentities/finalizers
  - ciliumlocalredirectpolicies
  - ciliumlocalredirectpolicies/status
  - ciliumlocalredirectpolicies/finalizers
  verbs:
  - '*'
---
# Source: cilium/templates/cilium-operator-clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium-operator
rules:
- apiGroups:
  - ""
  resources:
  # to automatically delete [core|kube]dns pods so that are starting to being
  # managed by Cilium
  - pods
  verbs:
  - get
  - list
  - watch
  - delete
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  # to perform the translation of a CNP that contains `ToGroup` to its endpoints
  - services
  - endpoints
  # to check apiserver connectivity
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - cilium.io
  resources:
  - ciliumnetworkpolicies
  - ciliumnetworkpolicies/status
  - ciliumnetworkpolicies/finalizers
  - ciliumclusterwidenetworkpolicies
  - ciliumclusterwidenetworkpolicies/status
  - ciliumclusterwidenetworkpolicies/finalizers
  - ciliumendpoints
  - ciliumendpoints/status
  - ciliumendpoints/finalizers
  - ciliumnodes
  - ciliumnodes/status
  - ciliumnodes/finalizers
  - ciliumidentities
  - ciliumidentities/status
  - ciliumidentities/finalizers
  - ciliumlocalredirectpolicies
  - ciliumlocalredirectpolicies/status
  - ciliumlocalredirectpolicies/finalizers
  verbs:
  - '*'
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - get
  - list
  - update
  - watch
# For cilium-operator running in HA mode.
#
# Cilium operator running in HA mode requires the use of ResourceLock for Leader Election
# between mulitple running instances.
# The preferred way of doing this is to use LeasesResourceLock as edits to Leases are less
# common and fewer objects in the cluster watch "all Leases".
# The support for leases was introduced in coordination.k8s.io/v1 during Kubernetes 1.14 release.
# In Cilium we currently don't support HA mode for K8s version < 1.14. This condition make sure
# that we only authorize access to leases resources in supported K8s versions.
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - create
  - get
  - update
---
# Source: cilium/templates/cilium-agent-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium
subjects:
- kind: ServiceAccount
  name: cilium
  namespace: kube-system
---
# Source: cilium/templates/cilium-operator-clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium-operator
subjects:
- kind: ServiceAccount
  name: cilium-operator
  namespace: kube-system
---
# Source: cilium/templates/cilium-agent-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: cilium
  name: cilium
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: cilium
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 2
    type: RollingUpdate
  template:
    metadata:
      annotations:
        # This annotation plus the CriticalAddonsOnly toleration makes
        # cilium to be a critical pod in the cluster, which ensures cilium
        # gets priority scheduling.
        # https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        k8s-app: cilium
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - cilium
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --config-dir=/tmp/cilium/config-map
        command:
        - cilium-agent
        livenessProbe:
          httpGet:
            host: '127.0.0.1'
            path: /healthz
            port: 9876
            scheme: HTTP
            httpHeaders:
            - name: "brief"
              value: "true"
          failureThreshold: 10
          # The initial delay for the liveness probe is intentionally large to
          # avoid an endless kill & restart cycle if in the event that the initial
          # bootstrapping takes longer than expected.
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            host: '127.0.0.1'
            path: /healthz
            port: 9876
            scheme: HTTP
            httpHeaders:
            - name: "brief"
              value: "true"
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CILIUM_FLANNEL_MASTER_DEVICE
          valueFrom:
            configMapKeyRef:
              key: flannel-master-device
              name: cilium-config
              optional: true
        - name: CILIUM_FLANNEL_UNINSTALL_ON_EXIT
          valueFrom:
            configMapKeyRef:
              key: flannel-uninstall-on-exit
              name: cilium-config
              optional: true
        - name: CILIUM_CLUSTERMESH_CONFIG
          value: /var/lib/cilium/clustermesh/
        - name: CILIUM_CNI_CHAINING_MODE
          valueFrom:
            configMapKeyRef:
              key: cni-chaining-mode
              name: cilium-config
              optional: true
        - name: CILIUM_CUSTOM_CNI_CONF
          valueFrom:
            configMapKeyRef:
              key: custom-cni-conf
              name: cilium-config
              optional: true
        image: quay.io/cilium/cilium:v1.9.0
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - "/cni-install.sh"
              - "--enable-debug=false"
          preStop:
            exec:
              command:
              - /cni-uninstall.sh
        name: cilium-agent
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - SYS_MODULE
          privileged: true
        volumeMounts:
        - mountPath: /sys/fs/bpf
          name: bpf-maps
        - mountPath: /var/run/cilium
          name: cilium-run
        - mountPath: /host/opt/cni/bin
          name: cni-path
        - mountPath: /host/etc/cni/net.d
          name: etc-cni-netd
        - mountPath: /var/lib/cilium/clustermesh
          name: clustermesh-secrets
          readOnly: true
        - mountPath: /tmp/cilium/config-map
          name: cilium-config-path
          readOnly: true
          # Needed to be able to load kernel modules
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
        - mountPath: /run/xtables.lock
          name: xtables-lock
      hostNetwork: true
      initContainers:
      - command:
        - /init-container.sh
        env:
        - name: CILIUM_ALL_STATE
          valueFrom:
            configMapKeyRef:
              key: clean-cilium-state
              name: cilium-config
              optional: true
        - name: CILIUM_BPF_STATE
          valueFrom:
            configMapKeyRef:
              key: clean-cilium-bpf-state
              name: cilium-config
              optional: true
        - name: CILIUM_WAIT_BPF_MOUNT
          valueFrom:
            configMapKeyRef:
              key: wait-bpf-mount
              name: cilium-config
              optional: true
        image: quay.io/cilium/cilium:v1.9.0
        imagePullPolicy: IfNotPresent
        name: clean-cilium-state
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
          privileged: true
        volumeMounts:
        - mountPath: /sys/fs/bpf
          name: bpf-maps
          mountPropagation: HostToContainer
        - mountPath: /var/run/cilium
          name: cilium-run
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
      restartPolicy: Always
      priorityClassName: system-node-critical
      serviceAccount: cilium
      serviceAccountName: cilium
      terminationGracePeriodSeconds: 1
      tolerations:
      - operator: Exists
      volumes:
        # To keep state between restarts / upgrades
      - hostPath:
          path: /var/run/cilium
          type: DirectoryOrCreate
        name: cilium-run
        # To keep state between restarts / upgrades for bpf maps
      - hostPath:
          path: /sys/fs/bpf
          type: DirectoryOrCreate
        name: bpf-maps
      # To install cilium cni plugin in the host
      - hostPath:
          path:  /opt/cni/bin
          type: DirectoryOrCreate
        name: cni-path
        # To install cilium cni configuration in the host
      - hostPath:
          path: /etc/cni/net.d
          type: DirectoryOrCreate
        name: etc-cni-netd
        # To be able to load kernel modules
      - hostPath:
          path: /lib/modules
        name: lib-modules
        # To access iptables concurrently with other processes (e.g. kube-proxy)
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
        # To read the clustermesh configuration
      - name: clustermesh-secrets
        secret:
          defaultMode: 420
          optional: true
          secretName: cilium-clustermesh
        # To read the configuration from the config map
      - configMap:
          name: cilium-config
        name: cilium-config-path
---
# Source: cilium/templates/cilium-operator-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    io.cilium/app: operator
    name: cilium-operator
  name: cilium-operator
  namespace: kube-system
spec:
  # We support HA mode only for Kubernetes version > 1.14
  # See docs on ServerCapabilities.LeasesResourceLock in file pkg/k8s/version/version.go
  # for more details.
  replicas: 2
  selector:
    matchLabels:
      io.cilium/app: operator
      name: cilium-operator
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
      labels:
        io.cilium/app: operator
        name: cilium-operator
    spec:
      # In HA mode, cilium-operator pods must not be scheduled on the same
      # node as they will clash with each other.
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: io.cilium/app
                operator: In
                values:
                - operator
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - --config-dir=/tmp/cilium/config-map
        - --debug=$(CILIUM_DEBUG)
        command:
        - cilium-operator-generic
        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CILIUM_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: CILIUM_DEBUG
          valueFrom:
            configMapKeyRef:
              key: debug
              name: cilium-config
              optional: true
        image: quay.io/cilium/operator-generic:v1.9.0
        imagePullPolicy: IfNotPresent
        name: cilium-operator
        livenessProbe:
          httpGet:
            host: '127.0.0.1'
            path: /healthz
            port: 9234
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 3
        volumeMounts:
        - mountPath: /tmp/cilium/config-map
          name: cilium-config-path
          readOnly: true
      hostNetwork: true
      restartPolicy: Always
      priorityClassName: system-cluster-critical
      serviceAccount: cilium-operator
      serviceAccountName: cilium-operator
      tolerations:
      - operator: Exists
      volumes:
        # To read the configuration from the config map
      - configMap:
          name: cilium-config
        name: cilium-config-path

ByteAlex commented 3 years ago

@ByteAlex i gave you the link to the github rendered file, not the raw file, this is why it failed. I talked with that about our DevOps and we both think that you misconfigured cilium. Have a look at the config file: (Now the raw applyable file): https://raw.githubusercontent.com/hetznercloud/hcloud-cloud-controller-manager/master/e2etests/templates/cilium.yml

I've tried this config and it indeed does work, yet when trying the latest v1.9.0 on cilium it does not.

The following steps were done:

Download https://raw.githubusercontent.com/cilium/cilium/1.9.0/install/kubernetes/quick-install.yaml
tunnel: disabled
masquerade: "true"
enable-endpoint-routes: "true"
auto-direct-node-routes: "false"
native-routing-cidr: "10.0.0.0/8" as it is my hetzner network
cluster-pool-ipv4-cidr: "10.224.0.0/16" so cilium's IPAM assignes addresses from kubernetes pod range properly
blacklist-conflicting-routes: "false" even though that should no longer be needed in v1.9.0+

Which results in the following difference: https://www.diffchecker.com/WWRn81qq

I also tried readding the deleted entries from L141-151, but that also didn't change the result: v1.9.0 does not work with hetzner using above configuration. Do you have another idea what could have went wrong there?

LKaemmerling commented 3 years ago

@ByteAlex as I said, it should be your configuration. cilium should not assign the addresses, this makes k8s. Cilium is basically just the one that put the data "on the wire" (maybe it encrypts it before).

ByteAlex commented 3 years ago

The CNI plugs into Kubernetes IPAM https://github.com/containernetworking/cni/blob/spec-v0.4.0/SPEC.md When I try to start v1.9.0 without IPAM, the cilium containers don't start and stay in an error state.

Also I might misstated the "new" issue. The routes were created correctly (at least this time), but I can't establish an successful connection inside kube pods from different nodes.

But when trying to access a kubernetes service address it works when doing it from host. grafik

Same curl from inside an kubernetes pod: grafik

ByteAlex commented 3 years ago

@LKaemmerling If you could spare a few more minutes on this - Could you please try deploying the v1.9.0 of Cilium to check whether that works for you too? I still can't get this working. Sorry for bothering.

nupplaphil commented 3 years ago

@ByteAlex - if you didn't found the solution yet:

Use the raw cilium config, @LKaemmerling already mentioned: https://raw.githubusercontent.com/hetznercloud/hcloud-cloud-controller-manager/master/e2etests/templates/cilium.yml
Add the following cilium config-entries under the config-entry blacklist-conflicting-routes: "false":
```
enable-endpoint-routes: "true"
native-routing-cidr: "10.244.0.0/16"
```
=> Works for me :)

Sources:

ByteAlex commented 3 years ago

@nupplaphil Yeah I figured that the config LKaemmerling posted worked, but not on the latest version. That's why I asked again. But I think I can go ahead and close this issue. Thanks for your assistance!

AlexMe99 commented 3 years ago

@ByteAlex I went through this topic when I wanted to init k8s cluster (v.1.20.0) on hetzner-cloud with cilium (1.9.4). I faced similar issues. What solved them for me was (1) taking care of setting the appropriate --node-ip (internal network ip) on each node (master+worker) for the kubelet as start argument (via kubelet.service.d kubelet-extra-arg) and (2) creating a subset of the general hetzner network for the subnet and another subset for the pod- and service-network. Sth like this: Network: 10.0.0.0/8; SUbnet: 10.1.0.0/16; Pod-Net: 10.2.0.0/16; Srv-Net: 10.3.0.0/16. Especially the appropriate setting of the networks was important. I'm not a networker but the masquerading seems to kill each attempt of seperating the subnet and the pod-/service-net into different domains (like 192... or similer).

philipp1992 commented 3 years ago

@AlexMe99

thanks for the info. could you share the exact settings ? how did you create the hcloud network, which arguments did you pass to k3s master and worker and what cilium deployment did you use?

kind regards Philipp

nupplaphil commented 3 years ago

If you're still curious, I solved it too.

This is my working cilium-file with cilium 1.9.5: https://github.com/nupplaphil/hcloud-k8s/blob/stable/roles/kube-master/files/cilium.yaml

But you have to keep an eye on your CIDRs at other places too (as @AlexMe99 already said), like https://github.com/nupplaphil/hcloud-k8s/blob/stable/roles/kube-master/files/hcloud-controller.yaml https://github.com/nupplaphil/hcloud-k8s/blob/f8ee5f18319ad3957a052603c84c4627d23a14e1/roles/kube-master/tasks/tasks.yaml#L6

mysticaltech commented 3 years ago

@ByteAlex I guess you finally understood that the pod CIDR allocation done by Cilium for each node is random! For instance, on the first deployment node 1 gets 10.244.0.0/24, but on the second deployment, it gets 10.244.2.0/24. It doesn't matter. And the resulting routes picked up by Hetzner CCM are indeed correct.

mysticaltech commented 3 years ago

Btw, for those like me using k3s, the default cluster CIDR is 10.42.0.0/16. so that's what we pass to --cluster-cidr= in the CCM config.

As for the service network, the default range is 10.43.0.0/16 for k3s, but from my understanding, it's a virtual IP range, so no config needed there, please correct me if I'm wrong.

kiwinesian commented 3 years ago

@mysticaltech may I ask what linux distro did you deploy it to? I was able to successfully run it in Ubuntu 18.04

However, I can't get it running on Debian 10 or Ubuntu 20.04 hcloud-csi-controller-0 keeps crashing when everything else looks good.

It is the csi-provisioner that keeps crashing.

Still new in this k8s - any direction will be much appreciated! 🙏

kiwinesian commented 3 years ago

Hi @LKaemmerling , do you mind to shed some light to my struggle above? :( Seems to work fine with 18.04, but the hcloud-csi-controller won't work in Debian 10 nor Ubuntu 20.04

mysticaltech commented 3 years ago

@kiwinesian Only yesterday, after reading the Cilium docs really well, and checking with "cilium status" I realized that cilium was using host-networking in legacy mode, so iptables, not BPF. It turns out, even the most recent version of Ubuntu is still on kernel 5.4, but cilium needs kernel's > 5.10 to activate its latest goodness and especially BPF host-networking.

So I will be using Fedora 34, which has kernel 5.12.

There is also the fact that in the my current sub-optimal setup, I can't reach services (using k3s). I figure that is probably because of kube-proxy not completely replaced. So will be following instructions here too https://docs.cilium.io/en/v1.10/gettingstarted/kubeproxy-free/.

Will post updates when I'm done, but please do the same @kiwinesian if you are able to work on this before, and don't hesitate if you have any questions. Even on my old setup, Hetzner CSI was starting fine with the following:

Network 10.0.0.0/8 Subnetwork 10.0.0.0/16 cluster CIDR / pod CIDR / ipv4 pool cidr (in cilium): 10.42.0.0/16 (for k3s) native routing CIDR (in cilium): 10.0.0.0/8 ipam (in cilium): (remove mention of it, so that it chooses the default of cluster-scope)

kiwinesian commented 3 years ago

hi @mysticaltech

thanks so much for attending this! interesting part is that Ubuntu 18.04 works completely fine with the iptables, but the csi-provisioner is crashing in Ubuntu 20.04 (and Debian 10) using the same exact config for cilium.yaml

I can try the Fedora 34 one and see if I can get it going. Will have to rewrite the Ansible script to deploy all of this - will report back maybe after the weekend. :)

mysticaltech commented 3 years ago

Awesome @kiwinesian, yes it would be better to use Fedora because it always has the latest and greatest kernel, and cilium seems to be relying on those for a lot of things. Also, it's better because the whole advantage of eBPF is to bypass iptables and such, it also does XDP acceleration at the network interface layer (don't ask me what that is exactly haha).

To deploy I use terraform inspired from https://github.com/StarpTech/k-andy, honestly as an Ansible user too, Terraform is a lot easier to use for deployment, even though both technologies have a lot of overlapping. What I like this that Hetzner maintains their own terraform provider.

For sure will share too if and when I get a decent enough setup, and probably before that even. Let's keep ourselves mutually posted ✌️

kiwinesian commented 3 years ago

Hi @mysticaltech ,

Yes, for sure! Out of curiosity, have look into Calico?

Looks like they also have eBPF in the latest release.

mysticaltech commented 3 years ago

Been thinking about that too as plan B, especially if they support native routing!

kiwinesian commented 3 years ago

Performance wise, it looks like Calico is a bit lighter on the resource consumptions

On Sat., Jul. 3, 2021, 12:07 Karim Naufal, @.***> wrote:

Been thinking about that too as plan B, especially if they support native routing!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hetznercloud/hcloud-cloud-controller-manager/issues/115#issuecomment-873455970, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7F6IZGFWJTRJR755D2PKDTV5NXDANCNFSM4TRJFONQ .

kiwinesian commented 3 years ago

@mysticaltech tested without the kube-proxy Unfortunately, without installing kube-proxy, no routes will be created hence many pods will fail to run/ created. not automatically at least.

mysticaltech commented 3 years ago

Thanks for sharing, I'm at the same place. It seems doable, but we're probably missing something in the config.

mysticaltech commented 3 years ago

@kiwinesian Created a project for this https://github.com/mysticaltech/kube-hetzner, all works well, including full kube-proxy replacement. However, even though everything was setup with cilium for native routing, I had to use tunnel: geneve see https://github.com/mysticaltech/kube-hetzner/blob/master/manifests/helm/cilium/values.yaml, to make everything really stable, somehow, pure native routing did not make the hetzner csi happy (maybe more debug is needed in the future). The geneve tunnel overhead is really low.

So thanks to cilium in combination with Fedora, we now have full BPF support, and full kube-proxy replacement with the improvement that it brings.

kiwinesian commented 3 years ago

ohh thanks for sharing @mysticaltech ! I might find a weekend to spin the cluster up using your configuration ;)

I just managed to get the k8s cluster going, but there are a few things that I would like to validate with you and see if it makes sense/ ideal:

Can't use Cilium ipam Native-Routing. Instead, I have to set ipam=kubernetes
Also had the same problem with tunnel=disabled and setting nativeRoutingCIDR=x.x.x.x/8. I ended up leaving it enabled and seems to keeping it happy - even though the reference said to do so https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/master/docs/deploy_with_networks.md.
Ubuntu 20.04 is compatible with kernel 5.11 - but can't go any higher than 5.11.0 or else it will break due to dependency.

I'm wondering if you are aware the impact of #1 and #2 leaving it as it is? should I attempt with tunnel=geneve?

mysticaltech commented 3 years ago

I guess you now know the answer @kiwinesian, but for others reading this, yes using tunnel=geneve does work pretty well.

philipp1992 commented 3 years ago

could someone explain, what would be the advantage of using hcloud native routing? thanks

kiwinesian commented 3 years ago

ha, not quite sure what the actual advantage is @philipp1992 I always wonder myself if we are missing some benefits with not have native-routing

ByteAlex commented 3 years ago

@philipp1992 @kiwinesian Native routing saves one layer of tunneling/vxlan. You should probably know why that is an advantage.

joriatyBen commented 2 years ago

@AlexMe99 yours did it for me. Thanks. Network: 10.0.0.0/8; Subnet-Master: 10.2.0.0/24; Subnet-Worker: 10.3.0.0/24; Pod-Net: 10.4.0.0/16; Srv-Net: 10.5.0.0/16. Installed Cilium via Helm.

hetznercloud / hcloud-cloud-controller-manager

Cloud-Controller w/network (native routing) does not create correct routes #115