What steps did you take and what happened:
[A clear and concise description of what the bug is.]
The test [It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers often fails due to timeout in CI.
The upgrade test itself seems to pass, but teardown fails.
The output looks like something:
STEP: THE UPGRADED MANAGEMENT CLUSTER WORKS! @ 04/16/24 02:49:40.808
STEP: PASSED! @ 04/16/24 02:49:40.808
STEP: Dumping logs from the "clusterctl-upgrade-wcfhw0" workload cluster @ 04/16/24 02:49:40.818
STEP: Dumping all the Cluster API resources in the "clusterctl-upgrade" namespace @ 04/16/24 02:49:40.818
STEP: Deleting all cluster.x-k8s.io/v1beta1 clusters in namespace clusterctl-upgrade in management cluster clusterctl-upgrade-wcfhw0 @ 04/16/24 02:49:43.521
STEP: Deleting cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf @ 04/16/24 02:49:43.623
INFO: Waiting for the Cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf to be deleted
STEP: Waiting for cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf to be deleted @ 04/16/24 02:49:43.685
[FAILED] in [AfterEach] - /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 @ 04/16/24 03:09:43.687
STEP: Node 8 released resources: {ec2-normal:0, vpc:2, eip:2, ngw:2, igw:2, classiclb:2, ec2-GPU:0, volume-gp2:0, eventBridge-rules:50} @ 04/16/24 03:09:44.688
<< Timeline
[FAILED] Timed out after 1200.001s.
Expected
<bool>: false
to be true
In [AfterEach] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 @ 04/16/24 03:09:43.687
Full Stack Trace
sigs.k8s.io/cluster-api/test/framework.WaitForClusterDeleted({0x374cf98?, 0x5104ec0}, {{0x7f16fc3aa7a0?, 0xc000f16a20?}, 0xc002856700?}, {0xc001110ce0, 0x2, 0x2})
/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 +0x1c3
sigs.k8s.io/cluster-api/test/framework.DeleteAllClustersAndWait({0x374cf98?, 0x5104ec0}, {{0x375fd40?, 0xc000f16a20?}, {0xc000d77770?, 0x7?}}, {0xc001110ce0, 0x2, 0x2})
/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:272 +0x426
sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func3()
/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/e2e/clusterctl_upgrade.go:552 +0x4ba
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite]
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-aws/test/e2e/suites/unmanaged/unmanaged_suite_test.go:57
------------------------------
[SynchronizedAfterSuite] PASSED [1165.726 seconds]
[SynchronizedAfterSuite]
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-aws/test/e2e/suites/unmanaged/unmanaged_suite_test.go:57
Timeline >>
STEP: Dumping all the Cluster API resources in the "functional-gpu-cluster-wqgqck" namespace @ 04/16/24 02:57:48.829
STEP: Dumping all EC2 instances in the "functional-gpu-cluster-wqgqck" namespace @ 04/16/24 02:57:49.149
STEP: Deleting all clusters in the "functional-gpu-cluster-wqgqck" namespace with intervals ["20m" "10s"] @ 04/16/24 02:58:17.15
STEP: Deleting cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj @ 04/16/24 02:58:17.157
INFO: Waiting for the Cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj to be deleted
STEP: Waiting for cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj to be deleted @ 04/16/24 02:58:17.164
STEP: Deleting namespace used for hosting the "" test spec @ 04/16/24 03:04:47.359
INFO: Deleting namespace functional-gpu-cluster-wqgqck
folder created for eks clusters: /logs/artifacts/clusters/bootstrap/aws-resources
STEP: Tearing down the management cluster @ 04/16/24 03:16:09.45
INFO: Error getting pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-66d8956f77-2n4zj, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-kubeadm-control-plane-system/pods/capi-kubeadm-control-plane-controller-manager-66d8956f77-2n4zj": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58460->127.0.0.1:38745: read: connection reset by peer
INFO: Error getting pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-78d8cb7cf6-kdr97, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-kubeadm-bootstrap-system/pods/capi-kubeadm-bootstrap-controller-manager-78d8cb7cf6-kdr97": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: EOF
INFO: Error getting pod capa-system/capa-controller-manager-6b8f8b488c-9dcnc, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capa-system/pods/capa-controller-manager-6b8f8b488c-9dcnc": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58472->127.0.0.1:38745: read: connection reset by peer
INFO: Error getting pod capi-system/capi-controller-manager-656b74646d-djwxs, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-system/pods/capi-controller-manager-656b74646d-djwxs": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58458->127.0.0.1:38745: read: connection reset by peer
STEP: Deleting cluster-api-provider-aws-sigs-k8s-io CloudFormation stack @ 04/16/24 03:16:13.393
The above output may be a red herring, however - there's also this log line that indicates that we're not gett the collection logs from the cluster made to do the upgrade test, while the cluster being torn down in the pasted section is functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj.
/kind flake
What steps did you take and what happened: [A clear and concise description of what the bug is.]
The test
[It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers
often fails due to timeout in CI.The upgrade test itself seems to pass, but teardown fails.
The output looks like something:
The above output may be a red herring, however - there's also this log line that indicates that we're not gett the collection logs from the cluster made to do the upgrade test, while the cluster being torn down in the pasted section is
functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj
.What did you expect to happen:
Test resources are cleaned up without timeout.
Anything else you would like to add: