[Flaky Test] capz-windows-master ci-kubernetes-e2e-capz-master-windows.Overall

drewhagen commented 1 month ago

Which jobs are flaking

master-informing:
- capz-windows-master

Which tests are flaking?

ci-kubernetes-e2e-capz-master-windows.Overall

Since when has it been flaking?

Often once or twice daily since 09-03 05:10 CDT Prow link

Failed runs:

09-16 17:04 CDT Prow link
09-16 11:04 CDT Prow link
09-15 23:03 CDT Prow link
09-15 14:02 CDT Prow link
09-15 02:02 CDT Prow link

Testgrid link

Reason for failure (if possible)

Sun, 15 Sep 2024 19:12:24 +0000: cluster creation complete
Sun, 15 Sep 2024 19:12:25 +0000: bastion info: capi@null:22
Sun, 15 Sep 2024 19:12:25 +0000: wait for cluster to stabilize
Sun, 15 Sep 2024 19:17:25 +0000: cleaning up
./capz/run-capz-e2e.sh: line 103: capz::ci-build-azure-ccm::cleanup: command not found
E0915 19:17:55.212078    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:18:25.214460    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:18:55.216389    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:19:25.218324    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:19:55.220740    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
Unable to connect to the server: dial tcp 20.120.140.205:6443: i/o timeout
+ EXIT_VALUE=1
+ set +o xtrace
Cleaning up after docker in docker.

Anything else we need to know?

Relevant SIG(s)

/sig windows /kind flake

cc: @kubernetes/release-team-release-signal

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

drewhagen commented 1 month ago

Thanks y'all! I notice that #126096 is in active code review.

Also, @kubernetes/sig-windows-bugs The first release cut (1.32.0-alpha.1) is due in less than a week from today on Oct 1st 2024. Given that this flake is on master informing and is being addressed, can we consider this a Non-Blocker for this next release cut? Please advise - thank you!

marosset commented 1 month ago

This isn't a blocker.

These errors are failures in bringing up a test cluster and happen before we run any of the e2e tests. I think we need to figure out how to get more logs for these failures - either Azure ARM logs or possibly some logs from the capz-controllers

/cc @jsturtevant @ritikaguptams

knabben commented 1 month ago

Updated Windows 2022 in the job, but does not seems to have clear effects.

In parallel another infra issue is happening on CAPZ, related to the region availability and only gets fixed in the next retry.

    --------------------------------------------------------------------------------
    RESPONSE 409: 409 Conflict
    ERROR CODE: SkuNotAvailable
    --------------------------------------------------------------------------------
    {
      "error": {
        "code": "SkuNotAvailable",
        "message": "The requested VM size for resource 'Following SKUs have failed for Capacity Restrictions: Standard_D2s_v3' is currently not available in location 'westus2'. Please try another size or deploy to a different location or different zone. See https://aka.ms/azureskunotavailable for details.",
        "target": "vmSize"
      }
    }
    --------------------------------------------------------------------------------

drewhagen commented 3 weeks ago

/milestone v1.32

drewhagen commented 3 weeks ago

Hello @knabben @marosset. Thanks for taking action on this!

A friendly reminder of what's ahead:

Code freeze is starting 02:00 UTC Friday November 8th 2024 (about 3 weeks from now), and while there is still time, we want to ensure that each PR has a chance to be merged on time.

Given this timeline and capacity, will a fix for this continue to be aimed for the 1.32 release? Thanks! 😄 🚀

drewhagen commented 1 week ago

👋 @marosset @knabben Thanks for updating Windows 2022 in that job. Is this still an issue, and do we plan to resolve for v1.32?

To that end, I want to extend a friendly reminder that the code freeze is starting 02:00 UTC Friday November 8th 2024 (a little less than 1 week from now). Please make sure any new PRs have both lgtm and approved labels before the code freeze. Thanks! 👍

drewhagen commented 2 days ago

👋 Hello @marosset @knabben! Appreciate all of your efforts with this! Is the plan still to resolve this issue for v1.32 ? If so, a gentle reminder that the code freeze has started 02:00 UTC Friday November 8th 2024 . Please make sure any PRs have both lgtm and approved labels ASAP, and file an Exception. Thanks!

kubernetes / kubernetes