techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Execute Vagrant cluster in CI #57

Closed sleiner closed 2 years ago

sleiner commented 2 years ago

Proposed Changes

Checklist

timothystewart6 commented 2 years ago

This is awesome. Would love for CI to test the cluster! I have fixed #56 with https://github.com/techno-tim/k3s-ansible/commit/aa05ab153e83042290e51960aecae36443171c77

would love for you to merge in that change and test again! Thank you!

timothystewart6 commented 2 years ago

Also @sleiner if this does work, I will open a PR with the latest k3s to test it! Would love to get this in! Thank you!

sleiner commented 2 years ago

@timothystewart6 I have merged the current master and tried again πŸ‘πŸ» Unfortunately, the new post role is failing. Seems like the steps that are supposed to run only on one control node are actually run on all control nodes...

TASK [k3s/post : Apply metallb-system namespace] *******************************
changed: [control2]
changed: [control3]
fatal: [control1]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "apply", "-f", "/tmp/k3s/metallb-namespace.yaml"], "delta": "0:00:03.241528", "end": "2022-08-28 19:57:36.360004", "msg": "non-zero return code", "rc": 1, "start": "2022-08-28 19:57:33.118476", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists"], "stdout": "", "stdout_lines": []}
sleiner commented 2 years ago

if this does work, I will open a PR with the latest k3s to test it!

One problem I am seeing with the current setup is that the Vagrantfile has its own group vars. So if you update anything under inventory, it will not affect the Vagrant environment :/

Is that by design or should we (in a next step?) actually use inventory for the Vagrant tests?

timothystewart6 commented 2 years ago

if this does work, I will open a PR with the latest k3s to test it!

One problem I am seeing with the current setup is that the Vagrantfile has its own group vars. So if you update anything under inventory, it will not affect the Vagrant environment :/

Is that by design or should we (in a next step?) actually use inventory for the Vagrant tests?

I would love for this to be factored out so that it uses the same files as ansible and that vagrant doesn't have its own.

timothystewart6 commented 2 years ago

@timothystewart6 I have merged the current master and tried again πŸ‘πŸ» Unfortunately, the new post role is failing. Seems like the steps that are supposed to run only on one control node are actually run on all control nodes...

TASK [k3s/post : Apply metallb-system namespace] *******************************
changed: [control2]
changed: [control3]
fatal: [control1]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "apply", "-f", "/tmp/k3s/metallb-namespace.yaml"], "delta": "0:00:03.241528", "end": "2022-08-28 19:57:36.360004", "msg": "non-zero return code", "rc": 1, "start": "2022-08-28 19:57:33.118476", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists"], "stdout": "", "stdout_lines": []}

Good call. I will fix this

timothystewart6 commented 2 years ago

@sleiner ok, merge in the latest once more! I fixed it so it will only run once so that the subsequent checks do not fail since it already exists!

timothystewart6 commented 2 years ago

Odd, the failure is one that I typically see when a kube config isn't configured

https://github.com/techno-tim/k3s-ansible/runs/8061736643?check_suite_focus=true#step:8:182

Either that or it's taking a while in CI. If that's the case we might want to consider upping this count to something high like 40

https://github.com/techno-tim/k3s-ansible/blob/master/roles/k3s/master/tasks/main.yml#L60

timothystewart6 commented 2 years ago

Never mind, it looks like you have set it to 30 in your vagrant file

sleiner commented 2 years ago

Odd, the failure is one that I typically see when a kube config isn't configured

https://github.com/techno-tim/k3s-ansible/runs/8061736643?check_suite_focus=true#step:8:182

Either that or it's taking a while in CI. If that's the case we might want to consider upping this count to something high like 40

https://github.com/techno-tim/k3s-ansible/blob/master/roles/k3s/master/tasks/main.yml#L60

Nope, that one failed because we lost a lot of nodes during the k3s binary download. GitHub seems to have had DNS issues...

sleiner commented 2 years ago

@timothystewart6 the fact that the nginx/metallb integration test failed (apparently because 192.168.30.80 is down) is interesting though... I cannot reproduce this problem locally.

timothystewart6 commented 2 years ago

@sleiner

I just looked! So close. The only thing I can think of is maybe the MetalLB IP isn't getting assigned to 192.168.30.80

you can see it by running

kubectl get services nginx -o jsonpath="{.status.loadBalancer.ingress[0].ip}"

This should print out the IP the service is using

βœ— kubectl get services nginx -o jsonpath="{.status.loadBalancer.ingress[0].ip}"
192.168.30.80%
sleiner commented 2 years ago

Hmm, querying kubectl is what the test already does. You can see the output of kubectl here in the log (admittedly, it's somewhat hidden πŸ˜…). So the IP was assigned by MetalLB - plus we retry for 5 seconds so the assigned IP was likely already broadcast by the speaker. I really have no idea what is going on here (except maybe this IP could be used by the CI runner's network itself somehow?)...

timothystewart6 commented 2 years ago

I see. Yeah it is buried in the logs πŸ˜€ I see you have the timeout to 1s? (I think). What if you set it to something like 30sec?

https://github.com/techno-tim/k3s-ansible/pull/57/files#diff-2b33537a1c4e225dfe831432d9564910458cdaa42641d8dbcbded3a0bf11a67cR101

sleiner commented 2 years ago

@timothystewart6 Alright, it works now πŸ™Œ (I solved it by using curl as a more high-level API, compared to Python's urllib.request). I will rebase it and also test if it works with Ubuntu 22.04 (since 21.10 is not supported anymore).

Two more things:

  1. The patchset for using the sample inventory and vars with Vagrant directly is ready. You can take a look at it: #60
  2. Also, in #60 @twistedgrim suggested using molecule instead of our own custom solution. I will check whether we can switch to that easily.
timothystewart6 commented 2 years ago

Ah, 429 I will kick it off in a bit (too many requests)

timothystewart6 commented 2 years ago

@timothystewart6 Alright, it works now πŸ™Œ (I solved it by using curl as a more high-level API, compared to Python's urllib.request). I will rebase it and also test if it works with Ubuntu 22.04 (since 21.10 is not supported anymore).

Two more things:

  1. The patchset for using the sample inventory and vars with Vagrant directly is ready. You can take a look at it: Feat/vagrant uses same inventoryΒ #60
  2. Also, in Feat/vagrant uses same inventoryΒ #60 @twistedgrim suggested using molecule instead of our own custom solution. I will check whether we can switch to that easily.

Woo hoo! Nice work!! πŸ™Œ

timothystewart6 commented 2 years ago

πŸ’₯ Thank you @sleiner !