kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.5k stars 1.3k forks source link

flakes in clusterctl upgrade tests #11133

Open cahillsf opened 1 week ago

cahillsf commented 1 week ago

summarized by @chrischdi šŸ™‡


According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.

Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here

/kind flake

tormath1 commented 1 week ago

I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: https://github.com/kubernetes-sigs/kind/issues/3554 - I guess it's something to fix upstream.

EDIT: It seems to be an issue with inodes:

$ kind create cluster --retain --name=cluster3
Creating cluster "cluster3" ...
 āœ“ Ensuring node image (kindest/node:v1.31.0) šŸ–¼
 āœ— Preparing nodes šŸ“¦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
$ podman logs -f 7eb0838e6bb2
...
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 12 (bookworm)!

Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
chrischdi commented 1 week ago
Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files

That sounds very suspicious regarding to https://main.cluster-api.sigs.k8s.io/user/troubleshooting.html?highlight=sysctl#cluster-api-with-docker----too-many-open-files

Maybe would be a good start here to collect data about the actual used values :-)

BenTheElder commented 1 week ago

"cluster": "eks-prow-build-cluster",

I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not

Also perhaps something else on the cluster is using a lot of them.

ameukam commented 1 week ago

I confirm the daemonset runs on the EKS cluster.

tormath1 commented 1 week ago

Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs^1 of a failing test:

$ cat journal.log | grep -i "Journal started"
Aug 30 06:35:27 clusterctl-upgrade-management-fba3o1-control-plane systemd-journald[95]: Journal started
$ cat journal.log | grep -i "multi-user"
Aug 30 06:35:51 clusterctl-upgrade-management-fba3o1-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.

While on a non failing setup:

root@kind-control-plane:/# journalctl | grep -i "multi-user"
Sep 05 12:16:31 kind-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.
root@kind-control-plane:/# journalctl | grep -i "Journal started"
Sep 05 12:16:31 kind-control-plane systemd-journald[98]: Journal started

We can see that the multi-user.target^2 is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout^3 for reaching the multi-user.target hence the failure.

BenTheElder commented 6 days ago

It's possible? This part shouldn't really take long though ..

I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?)

Doesn't explain the inotify-exhaustion like failures.

sbueringer commented 6 days ago

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

tormath1 commented 6 days ago

We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.

We thought it's a nice way to save us time and the community money.

Maybe we have to roll that back

Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August.

sbueringer commented 6 days ago

main: https://github.com/kubernetes-sigs/cluster-api/pull/11067 release-1.8: https://github.com/kubernetes-sigs/cluster-api/pull/11144