No Control Plane machines came into existence.

adilGhaffarDev commented 3 months ago

Which jobs are flaking?

periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-6
periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-release-1-4

Which tests are flaking?

When following the Cluster API quick-start with Ignition Should create a workload cluster
When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest
When testing MachineDeployment scale out/in Should successfully scale a MachineDeployment up and down upon changes to the MachineDeployment replica count
When testing clusterctl upgrades using ClusterClass (v1.5=>current) [ClusterClass] Should create a management cluster and then upgrade all the providers
When testing ClusterClass changes [ClusterClass] Should successfully rollout the managed topology upon changes to the ClusterClass

Since when has it been flaking?

Minor flakes with this error have been happening for a long time.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

sbueringer commented 3 months ago

/triage accepted

Thx for reporting

sbueringer commented 3 months ago

Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure.

Just to make it easier to find a failed job

fabriziopandini commented 2 months ago

/priority important-soon

chrischdi commented 2 months ago

k8s-triage link

chrischdi commented 2 months ago

I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then.

I still have stuff around if there are ideas to filter information.

chrischdi commented 2 months ago

I was able to hit it again and triage a bit.

It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint.

TLDR: The haproxy load balancer did not forward traffic to the control plane node.

My current theory is:

CAPD wrote the new haproxy config file to the lb container, which includes the first control plane node as backend
- I can confirm that this one was correct in my setup by reading the file again using `docker run cp
CAPD did signal to haproxy to reload the config (by sending SIGHUP)
Afterwards CAPD did still not route requests to the running node.

I was able to "fix" the issue in this case by again sending SIGHUP to haproxy: docker kill -s SIGHUP <container> Afterwards haproxy did route requests to the running node.

I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy:

10453

Test setup:

GINKGO_FOCUS="PR-Blocking"
GINKGO_SKIP="\[Conformance\]"

So it only runs a single test.

I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too).

I then run the loop using ./scripts/test-pj.sh.

All code changes for my setup are here for reference: https://github.com/kubernetes-sigs/cluster-api/commit/dfe9d5e2478dbdc73b59be5533ea59234c8d2a9c

I did some optimisations, like packing all required images to scripts/images.tar and loading them from there instead of building + propably some others to make it faster and rely less on internet to not run into rate limiting stuff too.

chrischdi commented 2 months ago

Fixes are merged, let's check next week or so if the error occurs again.

chrischdi commented 2 months ago

The merged fix did not help.

chrischdi commented 2 months ago

For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a 0.4 => 1.6 => current upgrade test but with a slightly different log:

  STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 08:03:54.723
  INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig2940028556 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
  INFO: Waiting for provider controllers to be running
  STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 08:04:31.768
  INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-whdwb, container manager
  STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 08:04:31.892
  INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-8kppj, container manager
  STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 08:04:31.918
  INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-k8ptj, container manager
  STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 08:04:31.95
  INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-5mkbx, container manager
  STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 08:04:32.176
  STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 04/24/24 08:04:32.177
  INFO: Creating namespace clusterctl-upgrade
  INFO: Creating event watcher for namespace "clusterctl-upgrade"
  STEP: Creating a test workload cluster @ 04/24/24 08:04:32.193
  INFO: Creating the workload cluster with name "clusterctl-upgrade-o3zf09" using the "(default)" template (Kubernetes v1.23.17, 1 control-plane machines, 1 worker machines)
  INFO: Getting the cluster template yaml
  INFO: clusterctl config cluster clusterctl-upgrade-o3zf09 --infrastructure docker --kubernetes-version v1.23.17 --control-plane-machine-count 1 --worker-machine-count 1 --flavor (default)
  INFO: Applying the cluster template yaml to the cluster
  STEP: Waiting for the machines to exist @ 04/24/24 08:04:44.941
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
  STEP: Dumping logs from the "clusterctl-upgrade-hs16jr" workload cluster @ 04/24/24 08:09:44.953
  [FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 08:12:44.955
  << Timeline

  [FAILED] Timed out after 300.001s.
  Timed out waiting for all Machines to exist
  Expected
      <int64>: 0
  to equal
      <int64>: 2
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953

  Full Stack Trace
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
        /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 +0x28cd

pravarag commented 1 month ago

I'll investigate more on this issue. /assign

kubernetes-sigs / cluster-api