Tests with highly-available control planes sometimes fail to reach Service targeting Windows workload

nojnhuh commented 3 months ago

Which jobs are flaky:

https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=Job%20default%2Fcurl-to-ilb

Which tests are flaky:

Testgrid link:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-azure-e2e/1829280096825905152

Reason for failure (if possible):

My current theory is that the failures occur when Calico fails to initialize in time before the test creates and tries to make requests to the Service which routes to that node.

The Node's NetworkUnavailable condition only becomes false about 27 minutes after creation:

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 29 Aug 2024 22:58:56 +0000   Thu, 29 Aug 2024 22:58:56 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Thu, 29 Aug 2024 23:04:25 +0000   Thu, 29 Aug 2024 22:31:13 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 29 Aug 2024 23:04:25 +0000   Thu, 29 Aug 2024 22:31:13 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 29 Aug 2024 23:04:25 +0000   Thu, 29 Aug 2024 22:31:13 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 29 Aug 2024 23:04:25 +0000   Thu, 29 Aug 2024 22:32:37 +0000   KubeletReady                 kubelet is posting ready status

The test happens to try to create a Service on the Node around the same time as when the Node's network is initialized, which I'm guessing is the source of the flakiness:

  STEP: Creating an accessible load balancer for windows @ 08/29/24 22:58:05.269
  STEP: creating a Kubernetes client to the workload cluster @ 08/29/24 22:58:05.269
  STEP: creating an HTTP deployment @ 08/29/24 22:58:05.277
  STEP: waiting for deployment default/web-windowsu4c93z to be available @ 08/29/24 22:58:05.442
  Aug 29 22:58:05.442: INFO: starting to wait for deployment to become available
  Aug 29 22:58:55.638: INFO: Deployment default/web-windowsu4c93z is now available, took 50.195638518s
  STEP: creating an internal Load Balancer service @ 08/29/24 22:58:55.638
  Aug 29 22:58:55.638: INFO: starting to create an internal Load Balancer service
  STEP: waiting for service default/web-windowsu4c93z-ilb to be available @ 08/29/24 22:58:55.76
  Aug 29 22:58:55.760: INFO: waiting for service default/web-windowsu4c93z-ilb to be available
  Aug 29 23:00:06.001: INFO: service default/web-windowsu4c93z-ilb is available, took 1m10.241443118s
  STEP: connecting to the internal LB service from a curl pod @ 08/29/24 23:00:06.001
  Aug 29 23:00:06.030: INFO: starting to create a curl to ilb job
  STEP: waiting for job default/curl-to-ilb-jobw1ndo to be complete @ 08/29/24 23:00:06.069
  Aug 29 23:00:06.069: INFO: waiting for job default/curl-to-ilb-jobw1ndo to be complete
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:182 @ 08/29/24 23:05:06.29
  Aug 29 23:05:06.290: INFO: FAILED!

I noticed the calico-node-startup container on the Node is spending about 25 minutes looping these logs. Maybe there is some bad or missing config causing this behavior?

2024-08-29 22:33:52.343 [INFO][5260] startup/startup.go 432: Early log level set to info
2024-08-29 22:33:52.357 [WARNING][5260] startup/utils.go 135: Using auto-detected node name. It is recommended that an explicit value is supplied using the NODENAME environment variable.
2024-08-29 22:33:52.360 [INFO][5260] startup/utils.go 138: Determined node name: capz-e2e-qd5hp
2024-08-29 22:33:52.360 [INFO][5260] startup/startup.go 94: Starting node capz-e2e-qd5hp with version v3.26.1
2024-08-29 22:33:52.417 [INFO][5260] startup/startup.go 106: Skipping datastore connection test
2024-08-29 22:33:54.667 [ERROR][5260] startup/startup.go 158: failed to query kubeadm's config map error=Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=2s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-08-29 22:33:54.667 [WARNING][5260] startup/utils.go 48: Terminating
Calico node initialisation failed, will retry...

Even if that behavior is unexpected and resolved, we should probably add another checkpoint in the framework to ensure that each Node's NetworkUnavailable condition is false since that seems not to block the Ready condition.

Anything else we need to know:

links to go.k8s.io/triage appreciated
links to specific failures in spyglass appreciated

/kind flake

[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

nojnhuh commented 3 months ago

Doing some testing in https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/5101

nojnhuh commented 3 months ago

FYI @marosset @jsturtevant in case anything obvious jumps out to either of you here.

k8s-triage-robot commented 1 day ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / cluster-api-provider-azure

Tests with highly-available control planes sometimes fail to reach Service targeting Windows workload #5100