hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
734 stars 118 forks source link

Cilium tests failed. #429

Closed wondertalik closed 1 year ago

wondertalik commented 1 year ago

Hello,

i have a problem with cilium in a new clusters. Is it problem of cilium of something related with hetzner? Interesting, but another my k8s cluster pass all tests without errors.

📋 Test Report
❌ 2/39 tests failed (20/374 actions), 5 tests skipped, 1 scenarios skipped:
Test [allow-all-except-world]:
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-3: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.7 (10.98.0.7:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-5: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.5 (10.98.0.5:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-7: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.4 (10.98.0.4:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-9: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.6 (10.98.0.6:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-11: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.9 (10.98.0.9:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-13: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.7 (10.98.0.7:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-15: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.5 (10.98.0.5:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-17: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.4 (10.98.0.4:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-19: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.6 (10.98.0.6:0)
  ❌ allow-all-except-world/pod-to-host/ping-ipv4-21: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.9 (10.98.0.9:0)
Test [host-entity]:
  ❌ host-entity/pod-to-host/ping-ipv4-1: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.7 (10.98.0.7:0)
  ❌ host-entity/pod-to-host/ping-ipv4-3: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.5 (10.98.0.5:0)
  ❌ host-entity/pod-to-host/ping-ipv4-5: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.4 (10.98.0.4:0)
  ❌ host-entity/pod-to-host/ping-ipv4-7: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.6 (10.98.0.6:0)
  ❌ host-entity/pod-to-host/ping-ipv4-9: cilium-test/client-6f6788d7cc-pljtp (10.0.5.49) -> 10.98.0.9 (10.98.0.9:0)
  ❌ host-entity/pod-to-host/ping-ipv4-13: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.4 (10.98.0.4:0)
  ❌ host-entity/pod-to-host/ping-ipv4-15: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.6 (10.98.0.6:0)
  ❌ host-entity/pod-to-host/ping-ipv4-17: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.9 (10.98.0.9:0)
  ❌ host-entity/pod-to-host/ping-ipv4-21: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.7 (10.98.0.7:0)
  ❌ host-entity/pod-to-host/ping-ipv4-23: cilium-test/client2-bc59f56d5-hgx99 (10.0.5.58) -> 10.98.0.5 (10.98.0.5:0)
connectivity test failed: 2 tests failed
suc@entrance-nbg1:~$ cilium status --wait
    /¯¯\
 /¯¯\__/¯¯\    Cilium:          OK
 \__/¯¯\__/    Operator:        OK
 /¯¯\__/¯¯\    Hubble Relay:    OK
 \__/¯¯\__/    ClusterMesh:     disabled
    \__/

Deployment        hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
DaemonSet         cilium             Desired: 6, Ready: 6/6, Available: 6/6
Deployment        hubble-ui          Desired: 1, Ready: 1/1, Available: 1/1
Deployment        cilium-operator    Desired: 3, Ready: 3/3, Available: 3/3
Containers:       cilium-operator    Running: 3
                  hubble-relay       Running: 1
                  cilium             Running: 6
                  hubble-ui          Running: 1
Cluster Pods:     13/13 managed by Cilium
Image versions    hubble-ui          quay.io/cilium/hubble-ui:v0.11.0@sha256:bcb369c47cada2d4257d63d3749f7f87c91dde32e010b223597306de95d1ecc8: 1
                  hubble-ui          quay.io/cilium/hubble-ui-backend:v0.11.0@sha256:14c04d11f78da5c363f88592abae8d2ecee3cbe009f443ef11df6ac5f692d839: 1
                  cilium-operator    quay.io/cilium/operator-generic:v1.13.2@sha256:a1982c0a22297aaac3563e428c330e17668305a41865a842dec53d241c5490ab: 3
                  hubble-relay       quay.io/cilium/hubble-relay:v1.13.2@sha256:51b772cab0724511583c3da3286439791dc67d7c35077fa30eaba3b5d555f8f4: 1
                  cilium             quay.io/cilium/cilium:v1.13.2@sha256:85708b11d45647c35b9288e0de0706d24a5ce8a378166cadc700f756cc1a38d6: 6
apricote commented 1 year ago

Hey @wondertalik,

without any additional info on how you setup your cluster or installed Cilium I can not help you. You can take a look at how we setup cilium for our e2e tests: https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/892c20e199e1aee64e731aa9f05e9320dfc5d1bd/hack/dev-up.sh#L145-L148

In general I did not have problems with just running cilium install in the default configuration on Hetzner Cloud, but this always depends on how you setup your Kubernetes cluster. In general this is outside of what we (the cloud provider) provide support for.

wondertalik commented 1 year ago

Hey @apricote

you can find details here. This one i have installed with helm. Old one just with cilium install.

I tried use cilium install with terraform. So result was the same, same errors that i put in first message.

apricote commented 1 year ago

I noticed you are using the Network Support from hccm to setup Routes for Pod Subnets in Hetzner Cloud, but your Cilium is not configured to make use of this. Take a look at the 3 --set flags I set above to learn how you can configure cilium to make use of this. This may be one source of issues.

wondertalik commented 1 year ago

@apricote thanks --set flags whas helpfull. thanks.

I just figure out, after installing cilium and hccm need to restart kubelet service on each node because pods was with external ip. After restarting all pods changed to cidr ips.

But now i have another problem

```bash
   /¯¯\
 /¯¯\__/¯¯\    Cilium:          6 errors
 \__/¯¯\__/    Operator:        OK
 /¯¯\__/¯¯\    Hubble Relay:    disabled
 \__/¯¯\__/    ClusterMesh:     disabled
    \__/

DaemonSet         cilium             Desired: 5, Ready: 5/5, Available: 5/5
Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Containers:       cilium             Running: 5
                  cilium-operator    Running: 2
Cluster Pods:     6/6 managed by Cilium
Image versions    cilium             quay.io/cilium/cilium:v1.13.2: 5
                  cilium-operator    quay.io/cilium/operator-generic:v1.13.2: 2
Errors:           cilium             cilium-nxpmf    controller sync-to-k8s-ciliumendpoint (802) is failing since 10s (16x): endpoint sync cannot take ownership of CEP that is not local ("external-ip-1")
                  cilium             cilium-wlgpp    controller sync-to-k8s-ciliumendpoint (1790) is failing since 4s (17x): endpoint sync cannot take ownership of CEP that is not local ("external-ip-1")
                  cilium             cilium-wlgpp    controller sync-to-k8s-ciliumendpoint (113) is failing since 4s (17x): endpoint sync cannot take ownership of CEP that is not local ("external-ip-3")
                  cilium             cilium-wlgpp    controller sync-to-k8s-ciliumendpoint (4) is failing since 4s (17x): endpoint sync cannot take ownership of CEP that is not local ("external-ip-3")
                  cilium             cilium-z9lll    controller sync-to-k8s-ciliumendpoint (276) is failing since 5s (17x): endpoint sync cannot take ownership of CEP that is not local ("external-ip-4")
                  cilium             cilium-z9lll    controller sync-to-k8s-ciliumendpoint (1531) is failing since 5s (17x): endpoint sync cannot take ownership of CEP that is not local ("external-ip-4")

is it bug of cilium or some bad configuration of cluster?

apricote commented 1 year ago

Hey @wondertalik,

so far this does not look like a bug in hcloud-cloud-controller-manager. We do not provide support for anything else here, so I will close this issue now.

Good luck with your cilium config, might be worth to recreate the cluster, in case there is still some components running with bad (cached) config.

wondertalik commented 1 year ago

helm install cilium cilium --repo https://helm.cilium.io/ -n kube-system --version 1.13.1 \ --set tunnel=disabled \ --set ipv4NativeRoutingCIDR=$cluster_cidr \ --set ipam.mode=kubernetes

With this options cilium stop pass any tests and core dns in status pending. Maybe there are some specific options that out of docs?

Logs of core dns

[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: i/o timeout

Using this values working like a charm and all test passed.

hccm

helm upgrade --install hccm charts/hccm/src/hcloud-cloud-controller-manager -f charts/hccm/values.yaml --namespace kube-system --set networking.enabled=true --set networking.clusterCIDR=$POD_NETWORK_CIDR

cilium

helm upgrade --install cilium charts/cilium/src/cilium -f charts/cilium/values.yaml \
   --namespace kube-system \
   --set operator.replicas=3 \
   --set hubble.relay.enabled=true \
   --set hubble.ui.enabled=true
wondertalik commented 6 months ago

Hello, @apricote. After upgrade to 1.15 with network support.

Clium's helm chart 1.15.4 with options:

helm upgrade --install cilium charts/cilium/src/cilium -f charts/cilium/values.yaml --reuse-values --namespace kube-system --set kubeProxyReplacement=true --set k8sServiceHost=10.0.1.2 --set k8sServicePort=6443 --set operator.replicas=1 --set hubble.relay.enabled=true --set hubble.ui.enabled=true --set hubble.ui.ingress.enabled=false --set routingMode=native --set ipv4NativeRoutingCIDR=10.0.16.0/20 --set ipam.mode=kubernetes --set k8s.requireIPv4PodCIDR=true
  1. Small detail in a manual that i found in Deployment with Networks support:

When deploying Cilium, make sure that you have set tunnel: disabled and nativeRoutingCIDR to your clusters subnet CIDR. If you are using Cilium < 1.9.0 you also have to set blacklist-conflicting-routes: false.

And Upgrade Guide to cilium 1.15 now says this:

The tunnel option (deprecated in Cilium 1.14) has been removed. To enable native-routing mode, set routing-mode=native (previously tunnel=disabled). To configure the tunneling protocol, set tunnel-protocol=vxlan|geneve (previously tunnel=vxlan|geneve).

  1. I have only one fail in cilium connectivity test

ime="2024-05-02T08:19:51Z" level=error msg="Error while inserting service in LB map" error="Unable to upsert service [real ip ipv6]:80 as IPv6 is disabled" k8sNamespace=ingress-nginx k8sSvcName=ingress-nginx-controller subsys=k8s-watcher (1 occurrences)

k get svc -o yaml -n ingress-nginx ingress-nginx-controller
apiVersion: v1
kind: Service
metadata:
  annotations:
    load-balancer.hetzner.cloud/location: nbg1
    load-balancer.hetzner.cloud/name: load-balancer-ingreses
    load-balancer.hetzner.cloud/type: lb11
    load-balancer.hetzner.cloud/use-private-ip: "true"
    meta.helm.sh/release-name: ingress-nginx
    meta.helm.sh/release-namespace: ingress-nginx
  creationTimestamp: "2024-05-02T08:17:01Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.10.1
    helm.sh/chart: ingress-nginx-4.10.1
  name: ingress-nginx-controller
  namespace: ingress-nginx
  resourceVersion: "2012"
  uid: 641b8039-4895-48f8-8971-9b2e90ecb00a
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.103.255.105
  clusterIPs:
  - 10.103.255.105
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    nodePort: 30646
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 30560
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: [real ip ipv4]
    - ip: [real ip ipv6]
    - ip: 10.0.1.7

i tried to use annotation load-balancer.hetzner.cloud/ipv6-disabled: true but result is the same. As i understand ipv6 is enabled. So this error of hcloud o some my configuration?