[Flake] Etcd timeout -> leader election failure -> webhook down

This issue is mostly to document and keep track of the test failures. The issue is not with BMO itself, rather a performance issue in the CI system.

Which jobs are flaking

Possibly all running on Jenkins workers in Xerces. It has been observed in BMO e2e tests at least.

Reason for failure (if possible):

Occasionally we see tests fail with a failed to call webhook (see logs below) even though the webhook was working just before and no changes were made to it. Checking the BMO logs reveal that the issue is with etcd. BMO is unable to renew its lease or perform leader election. As a result, it stops and then restarts. This is why the webhook is refusing connection.

Test logs:

[2024-05-21T02:06:02.489Z] • [FAILED] [7.212 seconds]
[2024-05-21T02:06:02.489Z] Inspection [It] should inspect a newly created BMH [required, inspection]
[2024-05-21T02:06:02.489Z] /home/metal3ci/workspace/metal3-bmo-e2e-test-periodic-release-0.6/test/e2e/inspection_test.go:85
[2024-05-21T02:06:02.489Z] 
[2024-05-21T02:06:02.489Z]   Timeline >>
[2024-05-21T02:06:02.489Z]   INFO: Creating namespace inspection-wcmx49
[2024-05-21T02:06:02.489Z]   INFO: Creating event watcher for namespace "inspection-wcmx49"
[2024-05-21T02:06:02.489Z]   STEP: Creating a secret with BMH credentials @ 05/21/24 02:05:55.385
[2024-05-21T02:06:02.489Z]   STEP: creating a BMH @ 05/21/24 02:05:55.761
[2024-05-21T02:06:02.489Z]   [FAILED] in [It] - /home/metal3ci/workspace/metal3-bmo-e2e-test-periodic-release-0.6/test/e2e/inspection_test.go:110 @ 05/21/24 02:05:55.79
[2024-05-21T02:06:02.489Z]   INFO: Deleting namespace inspection-wcmx49
[2024-05-21T02:06:02.489Z]   << Timeline
[2024-05-21T02:06:02.489Z] 
[2024-05-21T02:06:02.489Z]   [FAILED] Unexpected error:
[2024-05-21T02:06:02.489Z]       <*errors.StatusError | 0xc000383180>: 
[2024-05-21T02:06:02.489Z]       Internal error occurred: failed calling webhook "baremetalhost.metal3.io": failed to call webhook: Post "[https://baremetal-operator-webhook-service.baremetal-operator-system.svc:443/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s](https://baremetal-operator-webhook-service.baremetal-operator-system.svc/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s)": dial tcp 10.99.193.108:443: connect: connection refused

BMO logs:

E0521 02:09:39.324811       1 leaderelection.go:369] Failed to update lock: etcdserver: request timed out
E0521 02:09:42.298968       1 leaderelection.go:332] error retrieving resource lock baremetal-operator-system/baremetal-operator: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/baremetal-operator-system/leases/baremetal-operator": context deadline exceeded
I0521 02:09:42.299218       1 leaderelection.go:285] failed to renew lease baremetal-operator-system/baremetal-operator: timed out waiting for the condition
{"level":"info","ts":1716257382.3178487,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1716257382.3179276,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1716257382.3179662,"msg":"Stopping and waiting for caches"}
{"level":"info","ts":1716257382.3181348,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1716257382.3182237,"msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":1716257382.3182437,"msg":"Wait completed, proceeding to shutdown the manager"}

Anything else you would like to add:

We could possibly workaround or at least improve this by disabling leader election. I don't think this is a good idea though, since we may just be pushing the issue further and make it even harder to realize why tests fail. The only real solution is to ensure that the CI environment is performant enough to avoid these flakes.

/kind flake

metal3-io / baremetal-operator

[Flake] Etcd timeout -> leader election failure -> webhook down #1743