ovn-org / ovn-kubernetes

A robust Kubernetes networking platform
https://ovn-kubernetes.io/
Apache License 2.0
806 stars 339 forks source link

Shard tests timing out since recently #2590

Closed andreaskaris closed 1 month ago

andreaskaris commented 2 years ago

I hit this regularly in my PR https://github.com/ovn-org/ovn-kubernetes/pull/2582, but I also see this here in an unrelated (yet merged) PR: https://github.com/ovn-org/ovn-kubernetes/runs/3885062356

From the above log:

  /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:640
STEP: Creating a pod to test substitution in container's args
Oct 13 18:36:52.691: INFO: Waiting up to 5m0s for pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb" in namespace "var-expansion-4389" to be "Succeeded or Failed"
Oct 13 18:36:52.696: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 5.314034ms
Oct 13 18:36:54.699: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 2.007981296s
Oct 13 18:36:56.702: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4.011122853s
Oct 13 18:36:58.706: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 6.015325597s
Oct 13 18:37:00.712: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 8.020988409s
Oct 13 18:37:02.721: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 10.030537245s
Oct 13 18:37:04.726: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 12.035047445s
Oct 13 18:37:06.729: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 14.038036363s
Oct 13 18:37:08.736: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 16.04512193s
Oct 13 18:37:10.740: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 18.049340233s
Oct 13 18:37:12.744: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 20.052952944s
Oct 13 18:37:14.747: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 22.05613016s
Oct 13 18:37:16.750: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 24.059297676s
Oct 13 18:37:18.755: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 26.064347066s
Oct 13 18:37:20.760: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 28.068999662s
Oct 13 18:37:22.763: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 30.072502472s
Oct 13 18:37:24.766: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 32.075033496s
Oct 13 18:37:26.769: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 34.078737508s
Oct 13 18:37:28.774: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 36.083354908s
Oct 13 18:37:30.778: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 38.087565214s
Oct 13 18:37:32.782: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 40.09099003s
Oct 13 18:37:34.786: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 42.094997242s
Oct 13 18:37:36.790: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 44.098847155s
Oct 13 18:37:38.794: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 46.103108463s
Oct 13 18:37:40.798: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 48.107289774s
Oct 13 18:37:42.803: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 50.111869782s
Oct 13 18:37:44.807: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 52.115786198s
Oct 13 18:37:46.810: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 54.119049523s
Oct 13 18:37:48.813: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 56.122608845s
Oct 13 18:37:50.821: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 58.130758111s
Oct 13 18:37:52.825: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m0.134578731s
Oct 13 18:37:54.829: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m2.138472751s
Oct 13 18:37:56.836: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m4.144843341s
Oct 13 18:37:58.840: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m6.14906176s
Oct 13 18:38:00.843: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m8.152608387s
Oct 13 18:38:02.849: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m10.157929292s
Oct 13 18:38:04.852: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m12.161648517s
Oct 13 18:38:06.856: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m14.16566934s
Oct 13 18:38:08.860: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m16.169231467s
Oct 13 18:38:10.864: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m18.172932594s
Oct 13 18:38:12.868: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m20.177050316s
Oct 13 18:38:14.872: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 1m22.181056441s
Oct 13 18:41:38.027: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m45.336716154s
Oct 13 18:41:40.032: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m47.340945614s
Oct 13 18:41:42.037: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m49.346358376s
Oct 13 18:41:44.041: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m51.35065403s
Oct 13 18:41:46.046: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m53.355079084s
Oct 13 18:41:48.050: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m55.358804145s
Oct 13 18:41:50.054: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m57.363065782s
Oct 13 18:41:52.058: INFO: Pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb": Phase="Pending", Reason="", readiness=false. Elapsed: 4m59.366881096s
Oct 13 18:41:54.154: INFO: Failed to get logs from node "ovn-worker2" pod "var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb" container "dapi-container": the server rejected our request for an unknown reason (get pods var-expansion-0ca1ecfc-0d38-4a24-89f6-4d34d65a99fb)
STEP: delete the pod

I could also reproduce this locally. Under certain circumstances, one will see the following in kubelet logs:

Oct 19 04:55:07 ovn-worker2 kubelet[480]: E1019 04:55:07.065350     480 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "fedora-deployment-7575f87ff9-zsxn4_default(303b7b8f-b06e-4665-8222-7c6d03
5f8657)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292": CNI request failed with status 400: '[defaul
t/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292] [default/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b57
5768e1cc15292] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
Oct 19 04:55:07 ovn-worker2 kubelet[480]: '
Oct 19 04:55:07 ovn-worker2 kubelet[480]: E1019 04:55:07.065362     480 kuberuntime_manager.go:755] createPodSandbox for pod "fedora-deployment-7575f87ff9-zsxn4_default(303b7b8f-b06e-4665-8222-7c6d0
35f8657)" failed: rpc error: code = Unknown desc = failed to setup network for sandbox "ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292": CNI request failed with status 400: '[defau
lt/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292] [default/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b5
75768e1cc15292] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
Oct 19 04:55:07 ovn-worker2 kubelet[480]: '
Oct 19 04:55:07 ovn-worker2 kubelet[480]: E1019 04:55:07.065412     480 pod_workers.go:191] Error syncing pod 303b7b8f-b06e-4665-8222-7c6d035f8657 ("fedora-deployment-7575f87ff9-zsxn4_default(303b7b
8f-b06e-4665-8222-7c6d035f8657)"), skipping: failed to "CreatePodSandbox" for "fedora-deployment-7575f87ff9-zsxn4_default(303b7b8f-b06e-4665-8222-7c6d035f8657)" with CreatePodSandboxError: "CreatePo
dSandbox for pod \"fedora-deployment-7575f87ff9-zsxn4_default(303b7b8f-b06e-4665-8222-7c6d035f8657)\" failed: rpc error: code = Unknown desc = failed to setup network for sandbox \"ad79dd3cae2322a8f
0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292\": CNI request failed with status 400: '[default/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292] 
[default/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292] failed to get pod annotation: timed out waiting for annotations: context deadline exceed
ed\n'"
Oct 19 04:55:07 ovn-worker2 kubelet[480]: I1019 04:55:07.065796     480 event.go:291] "Event occurred" object="default/fedora-deployment-7575f87ff9-zsxn4" kind="Pod" apiVersion="v1" type="Warning" r
eason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox \"ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1
cc15292\": CNI request failed with status 400: '[default/fedora-deployment-7575f87ff9-zsxn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292] [default/fedora-deployment-7575f87ff9-zs
xn4 ad79dd3cae2322a8f0d50bbfba9a3ea0cd0f3d9f3d42eac4b575768e1cc15292] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded\n'"
Oct 19 04:55:07 ovn-worker2 kubelet[480]: I1019 04:55:07.585888     480 kubelet.go:1968] SyncLoop (housekeeping)
--More--

And this leads to a timeout of shard tests.

It times out here: go-controller/pkg/cni/cni.go:

func (pr *PodRequest) cmdAdd(kubeAuth *KubeAPIAuth, podLister corev1listers.PodLister, useOVSExternalIDs bool, kclient kubernetes.Interface) (*Response, error) {
        namespace := pr.PodNamespace
        podName := pr.PodName
        if namespace == "" || podName == "" {
                return nil, fmt.Errorf("required CNI variable missing")
        }

        kubecli := &kube.Kube{KClient: kclient}
        annotCondFn := isOvnReady

        if pr.IsSmartNIC {
                // Add Smart-NIC connection-details annotation so ovnkube-node running on smart-NIC
                // performs the needed network plumbing.
                if err := pr.addSmartNICConnectionDetailsAnnot(kubecli); err != nil {
                        return nil, err
                }
                annotCondFn = isSmartNICReady
        }
        // Get the IP address and MAC address of the pod
        // for Smart-Nic, ensure connection-details is present
        podUID, annotations, err := GetPodAnnotations(pr.ctx, podLister, kclient, namespace, podName, annotCondFn)
        if err != nil {
                return nil, fmt.Errorf("failed to get pod annotation: %v", err)
        }
        if err := pr.checkOrUpdatePodUID(podUID); err != nil {
                return nil, err
        }

        podInterfaceInfo, err := PodAnnotation2PodInfo(annotations, useOVSExternalIDs, pr.IsSmartNIC, pr.PodUID)
        if err != nil {
                return nil, err
        }
andreaskaris commented 2 years ago

I think https://github.com/ovn-org/ovn-kubernetes/commit/226a49b2a7b3a23e6e17f19ad4eeceeb256e747a might have improved this a lot.

jluhrsen commented 2 years ago

@andreaskaris , I was starting to look at what I think is this same problem last week. jira is here: https://issues.redhat.com/browse/SDN-2323

See my comment there where I noticed some job looked like it just hung for 45m before getting cancelled.

andreaskaris commented 2 years ago

The time out on pod creation can be observed in a lot of tests.

It's here, too: https://github.com/ovn-org/ovn-kubernetes/runs/3944199673

@jluhrsen I don't know if those are related. For the ones that I'm talking about, pod creation times out for the majority of tests of a specific test run. It happens only for a few tests though. I could reproduce this only once in my local setup, but since my laptop went into standby, I lost that cluster. From my observation last night, it seemed to be related to the CNI not being able to talk to the cluster, but without a reproducer, it's more speculation than anything.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been stalled for 5 days with no activity.