kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.5k stars 1.3k forks source link

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable #10838

Open Sunnatillo opened 2 months ago

Sunnatillo commented 2 months ago

Which jobs are flaking?

capi-e2e-main

Which tests are flaking?

When upgrading a workload cluster using ClusterClass with RuntimeSDK [ClusterClass] [It] Should create, upgrade and delete a workload cluster /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade_runtimesdk.go:155

Testgrid link

Edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake /area ci

  [FAILED] Failed after 63.517s.
  Resource versions didn't stay stable
  The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:53 with:
  Expected object to be comparable, diff:   map[string]string{
        ... // 11 identical entries
        "DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-r5yi9k":                                              "38350",
        "DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38404",
  -     "DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "39165",
  +     "DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "38721",
        "DockerMachinePoolTemplate/k8s-upgrade-with-runtimesdk-05ptjc/quick-start-default-worker-machinepooltemplate": "29519",
        "DockerMachineTemplate/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-0-pgr5r":      "30876",
        ... // 16 identical entries
        "Machine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38573",
        "MachineDeployment/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9": "38854",
  -     "MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "39168",
  +     "MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "38728",
        "MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-h754k":  "38853",
        "MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-tg2vx":  "38777",
        ... // 9 identical entries
    }
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:54 @ 06/27/24 04:19:26.795
adilGhaffarDev commented 2 months ago

@Sunnatillo link is pointing to different failure.

Sunnatillo commented 2 months ago

I updated it with correct link. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

fabriziopandini commented 2 months ago

/help

k8s-ci-robot commented 2 months ago

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/10838): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
willie-yao commented 1 month ago

/assign

willie-yao commented 1 month ago

I noticed that nodeVolumeDetachTimeout and minReadySeconds wasn't added to the machineDeployment spec for runtimesdk in #9393, so I'm gonna update that and see if the flake still happens.

sbueringer commented 1 month ago

Fine to add, I don't think it will affect the results though

chrischdi commented 1 month ago

Query to find the latest failures

sbueringer commented 1 month ago

Improvement to make CAPD DockerMachinePools more deterministic: https://github.com/kubernetes-sigs/cluster-api/pull/10998

(I wouldn't expect it to solve the whole flake though)

sbueringer commented 1 month ago

The CAPD flake seems to be gone now.

We only have a relatively rare flake with KCP left: https://storage.googleapis.com/k8s-triage/index.html?text=Detected%20objects%20with%20changed%20resourceVersion&job=.*cluster-api.*e2e.*main&xjob=.*-provider-.*

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1822127420073840640

willie-yao commented 5 days ago

The CAPD flake seems to be gone now.

Will unassign myself for now but if this flake is persistent, I can take another look when I have time.

/unassign

sbueringer commented 4 days ago

The MachinePool flake (https://github.com/kubernetes-sigs/cluster-api/issues/11162) is a lot more frequent/problematic

sivchari commented 4 days ago

I'll investigate it. /assign