techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.

https://technotim.live/posts/k3s-etcd-ansible/

Apache License 2.0

2.41k stars 1.05k forks source link

Mitigate CI flakiness #70

Closed sleiner closed 2 years ago

sleiner commented 2 years ago

Proposed Changes

Increase timeouts to reduce number of randomly failing CI jobs
- SSH connection timeout, see these affected jobs:
- https://github.com/techno-tim/k3s-ansible/runs/8176874625
- https://github.com/techno-tim/k3s-ansible/runs/8194471058
- https://github.com/techno-tim/k3s-ansible/runs/8201046374
- Timeout while applying MetalLB custom resources + retry, see these affected jobs:
- https://github.com/techno-tim/k3s-ansible/runs/8176701422
- https://github.com/sleiner/k3s-ansible/runs/8168834497
While modifying the post role anyway, I deduplicated some logic in it (side bonus: this makes the Ansible logs much more compact) and fixed a typo :-)

Checklist

[x] Tested locally
[x] Ran site.yml playbook
[x] Ran reset.yml playbook
[x] Did not add any unnecessary changes
[x] 🚀

timothystewart6 commented 2 years ago

Awesome! Thank you! Looks like the issue now is the molecule cache / destroy step which only seems to fail if there isn't a cache?

https://github.com/techno-tim/k3s-ansible/pull/48#issuecomment-1237570983

sleiner commented 2 years ago

@timothystewart6

Looks like the issue now is the molecule cache / destroy step which only seems to fail if there isn't a cache?

I don't see that - can you explain?

The current last job fails with an error message I have not seen before:

failed: [control1] (item=controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "deployment", "--namespace=metallb-system", "controller", "--for", "condition=Available=True", "--timeout=60s"], "delta": "0:00:01.804541", "end": "2022-09-06 10:07:00.859658", "item": {"condition": "--for condition=Available=True", "description": "controller", "name": "controller", "resource": "deployment"}, "msg": "non-zero return code", "rc": 1, "start": "2022-09-06 10:06:59.055117", "stderr": "error: the server doesn't have a resource type \"deployment\"", "stderr_lines": ["error: the server doesn't have a resource type \"deployment\""], "stdout": "", "stdout_lines": []}

sleiner commented 2 years ago

I just retried the job in my fork and it succeeded for the same commit, so it appears that we have another flakiness. Unfortunately, I cannot reproduce it locally. I'll just retry the job a few times to check how often this becomes a problem.

sleiner commented 2 years ago

After that rather mysterious failure, I have let the latest revision (1a4346f1b8cdada7575447d5f67038a87ebc2622) run 11 more times using these retrigger commits. These are the results:

✅ 8223173417
✅ 8224682259
✅ 8226058128
✅ 8226064723
✅ 8226065622
✅ 8226067221
✅ 8224682204
✅ 8226058058
✅ 8226064213
❌ 8226065815: 503 during application of MetalLB CRs
✅ 8226067041

So my proposal is doubling the retry delay for the MetalLB CR application and otherwise going with the changes as they are currently. It is already a substantial improvement over the current state and if flakiness becomes a significant issue again, we will tackle it then. What do you think, @timothystewart6?

timothystewart6 commented 2 years ago

Thank you very much!