techno-tim / k3s-ansible

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.
https://technotim.live/posts/k3s-etcd-ansible/
Apache License 2.0
2.41k stars 1.05k forks source link

Mitigate CI flakiness #70

Closed sleiner closed 2 years ago

sleiner commented 2 years ago

Proposed Changes

Checklist

timothystewart6 commented 2 years ago

Awesome! Thank you! Looks like the issue now is the molecule cache / destroy step which only seems to fail if there isn't a cache?

https://github.com/techno-tim/k3s-ansible/pull/48#issuecomment-1237570983

sleiner commented 2 years ago

@timothystewart6

Looks like the issue now is the molecule cache / destroy step which only seems to fail if there isn't a cache?

I don't see that - can you explain?

The current last job fails with an error message I have not seen before:

failed: [control1] (item=controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "deployment", "--namespace=metallb-system", "controller", "--for", "condition=Available=True", "--timeout=60s"], "delta": "0:00:01.804541", "end": "2022-09-06 10:07:00.859658", "item": {"condition": "--for condition=Available=True", "description": "controller", "name": "controller", "resource": "deployment"}, "msg": "non-zero return code", "rc": 1, "start": "2022-09-06 10:06:59.055117", "stderr": "error: the server doesn't have a resource type \"deployment\"", "stderr_lines": ["error: the server doesn't have a resource type \"deployment\""], "stdout": "", "stdout_lines": []}
sleiner commented 2 years ago

I just retried the job in my fork and it succeeded for the same commit, so it appears that we have another flakiness. Unfortunately, I cannot reproduce it locally. I'll just retry the job a few times to check how often this becomes a problem.

sleiner commented 2 years ago

After that rather mysterious failure, I have let the latest revision (1a4346f1b8cdada7575447d5f67038a87ebc2622) run 11 more times using these retrigger commits. These are the results:

So my proposal is doubling the retry delay for the MetalLB CR application and otherwise going with the changes as they are currently. It is already a substantial improvement over the current state and if flakiness becomes a significant issue again, we will tackle it then. What do you think, @timothystewart6?

timothystewart6 commented 2 years ago

Thank you very much!