Open adilGhaffarDev opened 3 months ago
/triage accepted
Thx for reporting
Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure.
Just to make it easier to find a failed job
/priority important-soon
I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then.
I still have stuff around if there are ideas to filter information.
I was able to hit it again and triage a bit.
It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint.
TLDR: The haproxy load balancer did not forward traffic to the control plane node.
My current theory is:
SIGHUP
)I was able to "fix" the issue in this case by again sending SIGHUP
to haproxy: docker kill -s SIGHUP <container>
Afterwards haproxy did route requests to the running node.
I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy:
Test setup:
GINKGO_FOCUS="PR-Blocking"
GINKGO_SKIP="\[Conformance\]"
So it only runs a single test.
I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too).
I then run the loop using ./scripts/test-pj.sh
.
All code changes for my setup are here for reference: https://github.com/kubernetes-sigs/cluster-api/commit/dfe9d5e2478dbdc73b59be5533ea59234c8d2a9c
I did some optimisations, like packing all required images to scripts/images.tar
and loading them from there instead of building + propably some others to make it faster and rely less on internet to not run into rate limiting stuff too.
Fixes are merged, let's check next week or so if the error occurs again.
The merged fix did not help.
For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a 0.4 => 1.6 => current
upgrade test but with a slightly different log:
STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 08:03:54.723
INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig2940028556 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
INFO: Waiting for provider controllers to be running
STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 08:04:31.768
INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-whdwb, container manager
STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 08:04:31.892
INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-8kppj, container manager
STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 08:04:31.918
INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-k8ptj, container manager
STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 08:04:31.95
INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-5mkbx, container manager
STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 08:04:32.176
STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 04/24/24 08:04:32.177
INFO: Creating namespace clusterctl-upgrade
INFO: Creating event watcher for namespace "clusterctl-upgrade"
STEP: Creating a test workload cluster @ 04/24/24 08:04:32.193
INFO: Creating the workload cluster with name "clusterctl-upgrade-o3zf09" using the "(default)" template (Kubernetes v1.23.17, 1 control-plane machines, 1 worker machines)
INFO: Getting the cluster template yaml
INFO: clusterctl config cluster clusterctl-upgrade-o3zf09 --infrastructure docker --kubernetes-version v1.23.17 --control-plane-machine-count 1 --worker-machine-count 1 --flavor (default)
INFO: Applying the cluster template yaml to the cluster
STEP: Waiting for the machines to exist @ 04/24/24 08:04:44.941
[FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
STEP: Dumping logs from the "clusterctl-upgrade-hs16jr" workload cluster @ 04/24/24 08:09:44.953
[FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 08:12:44.955
<< Timeline
[FAILED] Timed out after 300.001s.
Timed out waiting for all Machines to exist
Expected
<int64>: 0
to equal
<int64>: 2
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
Full Stack Trace
sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
/home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 +0x28cd
I'll investigate more on this issue. /assign
Which jobs are flaking?
Which tests are flaking?
Since when has it been flaking?
Minor flakes with this error have been happening for a long time.
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
No response
Anything else we need to know?
No response
Label(s) to be applied
/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.