techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.

https://technotim.live/posts/k3s-etcd-ansible/

Apache License 2.0

2.41k stars 1.05k forks source link

Add support for API servers on IPv6 addresses #48

Closed sleiner closed 2 years ago

sleiner commented 2 years ago

Proposed Changes

To correctly escape IPv6 addresses when ports are used, they must be wrapped in square brackets. This patch adds support for that, using Ansible's ipwrap filter.

Checklist

[x] Tested locally
[x] Ran site.yml playbook
[x] Ran reset.yml playbook
[x] Did not add any unnecessary changes
[x] 🚀

timothystewart6 commented 2 years ago

Can you give an example of what your group variables would look like using ipv6?

sleiner commented 2 years ago

Sure 😊

 # apiserver_endpoint is virtual ip-address which will be configured on each master
-apiserver_endpoint: "192.168.30.222"
+apiserver_endpoint: "2001:db8::1"

By the way, I have also tested IPv6 addresses for the MetalLB range - and it just works ™️

For example:

 # metallb ip range for load balancer
-metal_lb_ip_range: "192.168.30.80-192.168.30.90"
+metal_lb_ip_range: "2001:db8::1b:0/112"

sleiner commented 2 years ago

I have also tested IPv6 addresses for the MetalLB range - and it just works ™️

Turns out that is not true. I will take a closer look here...

sleiner commented 2 years ago

So when reducing my setup to the minimal working example, I'm setting these variables different than the example:

group vars:

apiserver_endpoint: 2001:db8::1

metal_lb_ip_range:
  - 2001:db8::1b:0/112
  - 10.10.10.0/24

extra_server_args: >-
  --no-deploy servicelb
  --no-deploy traefik
  --disable-network-policy
  --service-cidr=10.43.0.0/16,2001:db8:43::/112
  --cluster-cidr=10.42.0.0/16,2001:db8:42::/64
  --node-ip={{ k3s_node_ip }}

extra_agent_args: >-
  --node-ip={{ k3s_node_ip }}

host vars (for each server and agent):

k3s_node_ip: 10.10.3.1,2001:db8::de:1

so:

2001:db8:42::/64 is the network for the pods (IPv4: 10.42.0.0/16)
2001:db8:43::/112 is the network for the services (IPv4: 10.43.0.0/16)
2001:db8::/64 is the network in which the public services of the cluster are available (IPv4: 10.10.0.0/16)
- 2001:db8::1 is the IP on which the Kubernetes API is available
- 2001:db8::de:0/112 is used for cluster nodes (IPv4: 10.10.3.0/24)
- 2001:db8::1b:0/112 is used by MetalLB (IPv4: 10.10.10.0/24)

timothystewart6 commented 2 years ago

As much as I'd love to merge this, this does break the API contract by changing the metallb IP range from a string to an array. Not certain how I feel about this yet. I may close this and use it as an example for someone who wants to support ipv6

sleiner commented 2 years ago

this does break the API contract by changing the metallb IP range from a string to an array.

That is not the case :-) If you take a look at the metallb.ipaddresspool.j2 template, you will see that the current use case of specifying the range as a string is handled gracefully. So we get no breaking change.

Still, I decided to adapt all examples in the repo since it improves the discoverability of specifying multiple ranges.

Not certain how I feel about this yet.

Just to be sure: Are you talking about breaking changes or about modelling the ranges as list?

timothystewart6 commented 2 years ago

If we can get https://github.com/techno-tim/k3s-ansible/pull/57 working, I would feel much mor confident merging this in

sleiner commented 2 years ago

@timothystewart6 is this the race condition you were talking about?

timothystewart6 commented 2 years ago

@timothystewart6 is this the race condition you were talking about?

It is. I have some ideas, I will fix it tonight!

timothystewart6 commented 2 years ago

I know this might be a late ask but rather than change metal_lb_ip_range to an array, how about just adding another var like metal_lb_ip_range_ipv6 which is also a string? This way the existing interface does not change and then you don't have to check the type and loop over it. Thoughts? It also preserves the type going forward. Open to other ideas but I want to be sure that strings work too and that we test both use cases somehow. Just asking the question, doesn't mean it needs to change! 🙂

sleiner commented 2 years ago

@timothystewart6 applying the MetalLB CRs failed again (one of three times) - should we just add a retry here? :/

  TASK [k3s/post : Apply metallb CRs] ********************************************
failed: [control1] (item=control1) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "apply", "-f", "/tmp/k3s/metallb-crs.yaml"], "delta": "0:00:08.483325", "end": "2022-09-04 13:15:31.104057", "item": "control1", "msg": "non-zero return code", "rc": 1, "start": "2022-09-04 13:15:22.620732", "stderr": "Error from server (InternalError): error when creating \"/tmp/k3s/metallb-crs.yaml\": Internal error occurred: failed calling webhook \"l2advertisementvalidationwebhook.metallb.io\": failed to call webhook: Post \"[https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s\](https://webhook-service.metallb-system.svc/validate-metallb-io-v1beta1-l2advertisement?timeout=10s\)": EOF", "stderr_lines": ["Error from server (InternalError): error when creating \"/tmp/k3s/metallb-crs.yaml\": Internal error occurred: failed calling webhook \"l2advertisementvalidationwebhook.metallb.io\": failed to call webhook: Post \"[https://webhook-service.metallb-system.svc:443/validate-metallb-io-v1beta1-l2advertisement?timeout=10s\](https://webhook-service.metallb-system.svc/validate-metallb-io-v1beta1-l2advertisement?timeout=10s\)": EOF"], "stdout": "ipaddresspool.metallb.io/first-pool created", "stdout_lines": ["ipaddresspool.metallb.io/first-pool created"]}
  ok: [control1] => (item=control2)
  ok: [control1] => (item=control3)

timothystewart6 commented 2 years ago

@sleiner seems like there's some cache issue with molecule when cleaning up? https://github.com/techno-tim/k3s-ansible/runs/8194471058?check_suite_focus=true#step:7:225

sleiner commented 2 years ago

@timothystewart6 The problem originates here:

fatal: [control3]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Connection timed out during banner exchange", "unreachable": true}

As far as I see, the CI runner performance is quite weak when running 5 VMs at the same time. I have experimented with a few timeout tweaks (in a merge request based on this one) and the results look quite promising (at least compared to the current level of flakyness). You can take a look at the diff here.

I could:

... attach it to this merge request
... open another merge request based on this one
... modify the patch to not fix IPv6, merge it before this one and then modify this PR again

What do you prefer?

sleiner commented 2 years ago

@timothystewart6 I managed to make the flakiness mitigations completely independent from molecule.yml (and thus independent from this change), so I opened #70 - the merge order is irrelevant 😊

timothystewart6 commented 2 years ago

Sorry, conflicts now from the latest merge. I started reviewing and it looks good so far. Will merge after conflicts are solved! Thank you for this, this is huge!

timothystewart6 commented 2 years ago

Thank you for this! This is huge!