sleiner commented 2 years ago

Proposed Changes

Sometimes errors might be introduced
Example: #56 can be reproduced in the Vagrant cluster
This pull request spins up the Vagrant cluster in CI, runs the site playbook and runs some basic tests against the cluster.
Some tests are failing. From what I can tell, this is due to bugs in the cluster which need to be addressed. The tests work in my local cluster, though.

Checklist

[ ] Tested locally
[x] Ran site.yml playbook
[ ] Ran reset.yml playbook
[x] Did not add any unnecessary changes
[x] 🚀

timothystewart6 commented 2 years ago

This is awesome. Would love for CI to test the cluster! I have fixed #56 with https://github.com/techno-tim/k3s-ansible/commit/aa05ab153e83042290e51960aecae36443171c77

would love for you to merge in that change and test again! Thank you!

timothystewart6 commented 2 years ago

Also @sleiner if this does work, I will open a PR with the latest k3s to test it! Would love to get this in! Thank you!

sleiner commented 2 years ago

@timothystewart6 I have merged the current master and tried again 👍🏻 Unfortunately, the new post role is failing. Seems like the steps that are supposed to run only on one control node are actually run on all control nodes...

TASK [k3s/post : Apply metallb-system namespace] *******************************
changed: [control2]
changed: [control3]
fatal: [control1]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "apply", "-f", "/tmp/k3s/metallb-namespace.yaml"], "delta": "0:00:03.241528", "end": "2022-08-28 19:57:36.360004", "msg": "non-zero return code", "rc": 1, "start": "2022-08-28 19:57:33.118476", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists"], "stdout": "", "stdout_lines": []}

sleiner commented 2 years ago

if this does work, I will open a PR with the latest k3s to test it!

One problem I am seeing with the current setup is that the Vagrantfile has its own group vars. So if you update anything under inventory, it will not affect the Vagrant environment :/

Is that by design or should we (in a next step?) actually use inventory for the Vagrant tests?

timothystewart6 commented 2 years ago

if this does work, I will open a PR with the latest k3s to test it!

One problem I am seeing with the current setup is that the Vagrantfile has its own group vars. So if you update anything under inventory, it will not affect the Vagrant environment :/

Is that by design or should we (in a next step?) actually use inventory for the Vagrant tests?

I would love for this to be factored out so that it uses the same files as ansible and that vagrant doesn't have its own.

timothystewart6 commented 2 years ago

@timothystewart6 I have merged the current master and tried again 👍🏻 Unfortunately, the new post role is failing. Seems like the steps that are supposed to run only on one control node are actually run on all control nodes...

TASK [k3s/post : Apply metallb-system namespace] *******************************
changed: [control2]
changed: [control3]
fatal: [control1]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "apply", "-f", "/tmp/k3s/metallb-namespace.yaml"], "delta": "0:00:03.241528", "end": "2022-08-28 19:57:36.360004", "msg": "non-zero return code", "rc": 1, "start": "2022-08-28 19:57:33.118476", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/k3s/metallb-namespace.yaml\": namespaces \"metallb-system\" already exists"], "stdout": "", "stdout_lines": []}

Good call. I will fix this

timothystewart6 commented 2 years ago

@sleiner ok, merge in the latest once more! I fixed it so it will only run once so that the subsequent checks do not fail since it already exists!

timothystewart6 commented 2 years ago

Odd, the failure is one that I typically see when a kube config isn't configured

https://github.com/techno-tim/k3s-ansible/runs/8061736643?check_suite_focus=true#step:8:182

Either that or it's taking a while in CI. If that's the case we might want to consider upping this count to something high like 40

https://github.com/techno-tim/k3s-ansible/blob/master/roles/k3s/master/tasks/main.yml#L60

timothystewart6 commented 2 years ago

Never mind, it looks like you have set it to 30 in your vagrant file

sleiner commented 2 years ago

Odd, the failure is one that I typically see when a kube config isn't configured

https://github.com/techno-tim/k3s-ansible/runs/8061736643?check_suite_focus=true#step:8:182

Either that or it's taking a while in CI. If that's the case we might want to consider upping this count to something high like 40

https://github.com/techno-tim/k3s-ansible/blob/master/roles/k3s/master/tasks/main.yml#L60

Nope, that one failed because we lost a lot of nodes during the k3s binary download. GitHub seems to have had DNS issues...

sleiner commented 2 years ago

@timothystewart6 the fact that the nginx/metallb integration test failed (apparently because 192.168.30.80 is down) is interesting though... I cannot reproduce this problem locally.

timothystewart6 commented 2 years ago

@sleiner

I just looked! So close. The only thing I can think of is maybe the MetalLB IP isn't getting assigned to 192.168.30.80

you can see it by running

kubectl get services nginx -o jsonpath="{.status.loadBalancer.ingress[0].ip}"

This should print out the IP the service is using

✗ kubectl get services nginx -o jsonpath="{.status.loadBalancer.ingress[0].ip}"
192.168.30.80%

sleiner commented 2 years ago

Hmm, querying kubectl is what the test already does. You can see the output of kubectl here in the log (admittedly, it's somewhat hidden 😅). So the IP was assigned by MetalLB - plus we retry for 5 seconds so the assigned IP was likely already broadcast by the speaker. I really have no idea what is going on here (except maybe this IP could be used by the CI runner's network itself somehow?)...

timothystewart6 commented 2 years ago

I see. Yeah it is buried in the logs 😀 I see you have the timeout to 1s? (I think). What if you set it to something like 30sec?

https://github.com/techno-tim/k3s-ansible/pull/57/files#diff-2b33537a1c4e225dfe831432d9564910458cdaa42641d8dbcbded3a0bf11a67cR101

sleiner commented 2 years ago

@timothystewart6 Alright, it works now 🙌 (I solved it by using curl as a more high-level API, compared to Python's urllib.request). I will rebase it and also test if it works with Ubuntu 22.04 (since 21.10 is not supported anymore).

Two more things:

The patchset for using the sample inventory and vars with Vagrant directly is ready. You can take a look at it: #60
Also, in #60 @twistedgrim suggested using molecule instead of our own custom solution. I will check whether we can switch to that easily.

timothystewart6 commented 2 years ago

Ah, 429 I will kick it off in a bit (too many requests)

timothystewart6 commented 2 years ago

@timothystewart6 Alright, it works now 🙌 (I solved it by using curl as a more high-level API, compared to Python's urllib.request). I will rebase it and also test if it works with Ubuntu 22.04 (since 21.10 is not supported anymore).

Two more things:

The patchset for using the sample inventory and vars with Vagrant directly is ready. You can take a look at it: Feat/vagrant uses same inventory #60

Also, in Feat/vagrant uses same inventory #60 @twistedgrim suggested using molecule instead of our own custom solution. I will check whether we can switch to that easily.

Woo hoo! Nice work!! 🙌

timothystewart6 commented 2 years ago

💥 Thank you @sleiner !

techno-tim / k3s-ansible

Execute Vagrant cluster in CI #57

Proposed Changes

Checklist