tinkerbell / charts

Helm Charts
Apache License 2.0
34 stars 39 forks source link

System unavailable after deployment #93

Closed rgruyters closed 5 months ago

rgruyters commented 6 months ago

After deploying Tinkerbell chart in my K3s cluster, my session hangs and get a client_loop: send disconnect: Broken pipe message. When checking the interface configuration via console, I noticed that my IP config is removed from the primary interface.

Expected Behaviour

Server stays online and reachable

Current Behaviour

Server is unreachable

Noticed the following messages in the kube-vip logging:

time="2024-05-14T11:53:08Z" level=info msg="(svcs) adding VIP [[10.20.30.40]] for [kube-system/traefik]"
time="2024-05-14T11:53:08Z" level=info msg="[service] synchronised in 7ms"
time="2024-05-14T11:53:08Z" level=warning msg="(svcs) already found existing address [10.20.30.40] on adapter [eno1]"
time="2024-05-14T11:53:11Z" level=warning msg="Re-applying the VIP configuration [10.20.30.40] to the interface [eno1]"

Possible Solution

Possible to use stack.lbClass, but I don't know which class to use.

Steps to Reproduce (for bugs)

  1. trusted_proxies=$(kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' | tr ' ' ',')
  2. LB_IP=10.20.30.41
  3. helm install tink-stack charts/tinkerbell/charts/stack --create-namespace --namespace tink-system --wait --set "smee.trustedProxies={${trusted_proxies}}" --set "hegel.trustedProxies={${trusted_proxies}}" --set "stack.loadBalancerIP=$LB_IP" --set "smee.publicIP=$LB_IP"
  4. system gets unreachable

Context

Cannot use the Kubernetes cluster anymore nor Tinkerbell environment

Your Environment

jacobweinstock commented 6 months ago

Hey @rgruyters. It appears that you might have used an IP for your load balancer that is in use. The load balancer IP needs to be a free and unused IP in the same layer 2 network as eno1 interface. Reference

rgruyters commented 6 months ago

I don't have an IP address that is in use. The interface eno1 has an IP of 10.20.30.40. The loadbalancer IP is 10.20.30.41. Although both are within the same subnet. Somehow kube-vip sees the interface and registered the primary IP address as well. Which it shouldn't. Probable because traefik uses it within the cluster. (K3s uses traefik for servicelb)

I have checked the yaml output of kube-vip, but I can only find the loadbalancer IP. (which it should)

For now, I have disabled servicelb service within K3s, but wondering if I can use the servicelb instead of kube-vip.

jacobweinstock commented 6 months ago

ah ok. yeah looks like kubevip + traefik is requesting the same IP as the host. (svcs) adding VIP [[10.20.30.40]] for [kube-system/traefik]

You can disable kubevip in the chart and use any other load balancer. Use stack.kubevip.enabled: false and stack.lbClass equal to your existing load balancer.

rgruyters commented 6 months ago

Tried settings stack.lbClass to kube-system/traefik or even traefik but didn't work. The External IP stays in pending state.

Is there a way to find what the loadbalancer class is available?

jacobweinstock commented 5 months ago

I tried a few things using k3d and setting --set "stack.kubevip.enabled=false" and --set "stack.lbClass=" seems to be working for me. Let me know if that helps.

rgruyters commented 5 months ago

That works! Thanks!

MartinLoeper commented 5 months ago

I had a similar issue yesterday - also running the tinkerbell chart on k3s. The control-plane did not complain, but my worker nodes running k3s-agent lost connectivity on their primary enp4s0 interface. I figured out that the kube-vip ds is somehow causing the issue.

My solution (probably a workaround) was to add node affinity to the tinkerbell chart s.t. the kube-vip ds only runs on the control-plane nodes as suggested in: https://kube-vip.io/docs/installation/daemonset/#example-arp-manifest

MartinLoeper commented 5 months ago

I just replaced the whole kube-vip by setting --set "stack.kubevip.enabled=false" and --set "stack.lbClass=null" -> using k3s built-in load balancer implementation ServiceLB aka klipper.

There seems to be some kube-vip issue on k3s...

see: https://github.com/kube-vip/kube-vip/issues/798