Open cahillsf opened 1 week ago
I'll have a look to the "Failed to create kind cluster" issue as I already noticed something similar on my own Kind setup and I think it's not isolated: https://github.com/kubernetes-sigs/kind/issues/3554 - I guess it's something to fix upstream.
EDIT: It seems to be an issue with inodes:
$ kind create cluster --retain --name=cluster3
Creating cluster "cluster3" ...
ā Ensuring node image (kindest/node:v1.31.0) š¼
ā Preparing nodes š¦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
$ podman logs -f 7eb0838e6bb2
...
Detected virtualization podman.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 12 (bookworm)!
Failed to create control group inotify object: Too many open files
Failed to allocate manager object: Too many open files
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
Failed to create control group inotify object: Too many open files Failed to allocate manager object: Too many open files
That sounds very suspicious regarding to https://main.cluster-api.sigs.k8s.io/user/troubleshooting.html?highlight=sysctl#cluster-api-with-docker----too-many-open-files
Maybe would be a good start here to collect data about the actual used values :-)
"cluster": "eks-prow-build-cluster",
I don't know if we're running https://github.com/kubernetes/k8s.io/blob/3f2c06a3c547765e21dce65d0adcb1144a93b518/infra/aws/terraform/prow-build-cluster/resources/kube-system/tune-sysctls_daemonset.yaml#L4 there or not
Also perhaps something else on the cluster is using a lot of them.
I confirm the daemonset runs on the EKS cluster.
Thanks folks for confirming that the daemon set is correctly setting the sysctl parameters - so the error might be elsewhere, I noticed something else while reading the logs^1 of a failing test:
$ cat journal.log | grep -i "Journal started"
Aug 30 06:35:27 clusterctl-upgrade-management-fba3o1-control-plane systemd-journald[95]: Journal started
$ cat journal.log | grep -i "multi-user"
Aug 30 06:35:51 clusterctl-upgrade-management-fba3o1-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.
While on a non failing setup:
root@kind-control-plane:/# journalctl | grep -i "multi-user"
Sep 05 12:16:31 kind-control-plane systemd[1]: Reached target multi-user.target - Multi-User System.
root@kind-control-plane:/# journalctl | grep -i "Journal started"
Sep 05 12:16:31 kind-control-plane systemd-journald[98]: Journal started
We can see that the multi-user.target
^2 is reached at the same time as the journal started to log. On a failing test, there is a already 24 seconds of difference. I'm wondering if randomly (under heavy load) we don't reach the 30 seconds of timeout^3 for reaching the multi-user.target
hence the failure.
It's possible? This part shouldn't really take long though ..
I suspect that would be a noisy neighbor problem on the EKS cluster (I/O?)
Doesn't explain the inotify-exhaustion like failures.
We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.
We thought it's a nice way to save us time and the community money.
Maybe we have to roll that back
We recently increased concurrency in our tests. With that we were able to reduce Job durations from 2h to 1h.
We thought it's a nice way to save us time and the community money.
Maybe we have to roll that back
Do you remember when this change has been applied ? Those Kind failures seem to start by the end of August.
summarized by @chrischdi š
According to aggregated failures of the last two weeks, we still have some flakyness on our clusterctl upgrade tests.
36 failures:
Timed out waiting for all Machines to exist
16 Failures:
Failed to create kind cluster
14 Failures:
Internal error occurred: failed calling webhook [...] connect: connection refused
7 Failures:
x509: certificate signed by unknown authority
5 Failures:
Timed out waiting for Machine Deployment clusterctl-upgrade/clusterctl-upgrade-workload-... to have 2 replicas
2 Failures:
Timed out waiting for Cluster clusterctl-upgrade/clusterctl-upgrade-workload-... to provision
Link to check if messages changed or we have new flakes on clusterctl upgrade tests: here
/kind flake