Open nojnhuh opened 3 months ago
Doing some testing in https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/5101
FYI @marosset @jsturtevant in case anything obvious jumps out to either of you here.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Which jobs are flaky:
https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=Job%20default%2Fcurl-to-ilb
Which tests are flaky:
Testgrid link:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-azure-e2e/1829280096825905152
Reason for failure (if possible):
My current theory is that the failures occur when Calico fails to initialize in time before the test creates and tries to make requests to the Service which routes to that node.
The Node's NetworkUnavailable condition only becomes false about 27 minutes after creation:
The test happens to try to create a Service on the Node around the same time as when the Node's network is initialized, which I'm guessing is the source of the flakiness:
I noticed the calico-node-startup container on the Node is spending about 25 minutes looping these logs. Maybe there is some bad or missing config causing this behavior?
Even if that behavior is unexpected and resolved, we should probably add another checkpoint in the framework to ensure that each Node's NetworkUnavailable condition is false since that seems not to block the Ready condition.
Anything else we need to know:
/kind flake
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]