knative / eventing

Event-driven application platform for Kubernetes
https://knative.dev/docs/eventing
Apache License 2.0
1.42k stars 598 forks source link

Kind e2e tests sometimes fail with the webhook pod not becoming ready. #4496

Closed vaikas closed 3 years ago

vaikas commented 4 years ago

Describe the bug Eventing webhook does not sometimes become ready, looks like maybe the specific one that the wait loop is waiting for gets replaced (maybe because of chaos duck?) by another pod that does become ready.

From one example here: https://github.com/knative/eventing/pull/4492/checks?check_run_id=1381350649

pod/sugar-controller-7f7c8ddfc4-8gbfn condition met
error: timed out waiting for the condition on pods/eventing-webhook-5c8b8865c7-wbzjd
pod/zipkin-8fdcfcddc-d9rbm condition met
Error: Process completed with exit code 1.

Then when the artifacts are dumped, note a different webhook pod comes up:

eventing-controller-64768b7fcc-mpd29    1/1     Running   0          57s
eventing-webhook-5c8b8865c7-x7d7w       1/1     Running   0          57s
imc-controller-6f6b794fd6-m79tv         1/1     Running   0          117s
imc-dispatcher-64d5f8445-vwscb          1/1     Running   0          117s
mt-broker-controller-75475bcbc7-tks6f   1/1     Running   0          57s
mt-broker-filter-6f4c99cddd-xfjzw       1/1     Running   0          118s
mt-broker-ingress-64f6f6cb9f-spccf      1/1     Running   0          118s
sugar-controller-7f7c8ddfc4-8gbfn       1/1     Running   0          57s
zipkin-8fdcfcddc-d9rbm                  1/1     Running   0          104s

Expected behavior tests to not fail due to test setup failures.

To Reproduce Look at some of these failing tests here: https://github.com/knative/eventing/actions?query=workflow%3A%22KinD+e2e+tests%22

Knative release version head

Additional context Add any other context about the problem here such as proposed priority

vaikas commented 4 years ago

Looking...

Here's the step doing the wait:

    - name: Wait for things to be up
      run: |
        kubectl wait pod --for=condition=Ready -n ${SYSTEM_NAMESPACE} -l '!job-name'
pierDipi commented 4 years ago

https://github.com/knative/eventing/pull/4517/checks?check_run_id=1391428727

vaikas commented 4 years ago

https://github.com/knative/eventing/runs/1406746500?check_suite_focus=true

I1116 14:19:57.809327   27973 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.19.3 (linux/amd64) kubernetes/1e11e4a" 'https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-4zv5s&resourceVersion=2808&watch=true'
I1116 14:19:57.809985   27973 round_trippers.go:443] GET https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-4zv5s&resourceVersion=2808&watch=true 200 OK in 0 milliseconds
I1116 14:19:57.810003   27973 round_trippers.go:449] Response Headers:
I1116 14:19:57.810008   27973 round_trippers.go:452]     Cache-Control: no-cache, private
I1116 14:19:57.810011   27973 round_trippers.go:452]     Content-Type: application/json
I1116 14:19:57.810014   27973 round_trippers.go:452]     Date: Mon, 16 Nov 2020 14:19:57 GMT
I1116 14:20:27.810546   27973 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.19.3 (linux/amd64) kubernetes/1e11e4a" 'https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-k6mr8'
I1116 14:20:27.813242   27973 round_trippers.go:443] GET https://127.0.0.1:36147/api/v1/namespaces/knative-eventing/pods?fieldSelector=metadata.name%3Deventing-webhook-6bd5798587-k6mr8 200 OK in 2 milliseconds
I1116 14:20:27.813259   27973 round_trippers.go:449] Response Headers:
pod/eventing-webhook-6bd5798587-k6mr8 condition met
I1116 14:20:27.813263   27973 round_trippers.go:452]     Cache-Control: no-cache, private
I1116 14:20:27.813267   27973 round_trippers.go:452]     Content-Type: application/json
I1116 14:20:27.813270   27973 round_trippers.go:452]     Date: Mon, 16 Nov 2020 14:20:27 GMT
I1116 14:20:27.813811   27973 request.go:1097] Response Body: {"kind":"PodList","apiVersion":"v1","metadata":{"selfLink":"/api/v1/namespaces/knative-eventing/pods","resourceVersion":"3232"},"items":[{"metadata":{"name":"eventing-webhook-6bd5798587-k6mr8","generateName":"eventing-webhook-6bd5798587-","namespace":"knative-eventing","selfLink":"/api/v1/namespaces/knative-eventing/pods/eventing-webhook-6bd5798587-k6mr8","uid":"015fa6f7-88e9-45e5-95b2-91b2da58ee13","resourceVersion":"2579","creationTimestamp":"2020-11-16T14:19:36Z","labels":{"app":"eventing-webhook","pod-template-hash":"6bd5798587","role":"eventing-webhook"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"eventing-webhook-6bd5798587","uid":"59b46e0b-631b-4f3e-8dba-584e35bc6dea","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"kube-controller-manager","operation":"Update","apiVersion":"v1","time":"2020-11-16T14:19:36Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{},"f:labels":{".":{},"f:app":{},"f:pod-template-hash":{},"f:role":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"59b46e0b-631b-4f3e-8dba-584e35bc6dea\"}":{".":{},"f:apiVersion":{},"f:blockOwnerDeletion":{},"f:controller":{},"f:kind":{},"f:name":{},"f:uid":{}}}},"f:spec":{"f:affinity":{".":{},"f:podAntiAffinity":{".":{},"f:preferredDuringSchedulingIgnoredDuringExecution":{}}},"f:containers":{"k:{\"name\":\"eventing-webhook\"}":{".":{},"f:env":{".":{},"k:{\"name\":\"CONFIG_LOGGING_NAME\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"METRICS_DOMAIN\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"POD_NAME\"}":{".":{},"f:name":{},"f:valueFrom":{".":{},"f:fieldRef":{".":{},"f:apiVersion":{},"f:fieldPath":{}}}},"k:{\"name\":\"SINK_BINDING_SELECTION_MODE\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"SYSTEM_NAMESPACE\"}":{".":{},"f:name":{},"f:valueFrom":{".":{},"f:fieldRef":{".":{},"f:apiVersion":{},"f:fieldPath":{}}}},"k:{\"name\":\"WEBHOOK_NAME\"}":{".":{},"f:name":{},"f:value":{}},"k:{\"name\":\"WEBHOOK_PORT\"}":{".":{},"f:name":{},"f:value":{}}},"f:image":{},"f:imagePullPolicy":{},"f:livenessProbe":{".":{},"f:failureThreshold":{},"f:httpGet":{".":{},"f:httpHeaders":{},"f:path":{},"f:port":{},"f:scheme":{}},"f:initialDelaySeconds":{},"f:periodSeconds":{},"f:successThreshold":{},"f:timeoutSeconds":{}},"f:name":{},"f:ports":{".":{},"k:{\"containerPort\":8008,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}},"k:{\"containerPort\":8443,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}},"k:{\"containerPort\":9090,\"protocol\":\"TCP\"}":{".":{},"f:containerPort":{},"f:name":{},"f:protocol":{}}},"f:readinessProbe":{".":{},"f:failureThreshold":{},"f:httpGet":{".":{},"f:httpHeaders":{},"f:path":{},"f:port":{},"f:scheme":{}},"f:periodSeconds":{},"f:successThreshold":{},"f:timeoutSeconds":{}},"f:resources":{".":{},"f:limits":{".":{},"f:cpu":{},"f:memory":{}},"f:requests":{".":{},"f:cpu":{},"f:memory":{}}},"f:securityContext":{".":{},"f:allowPrivilegeEscalation":{}},"f:terminationMessagePath":{},"f:terminationMessagePolicy":{}}},"f:dnsPolicy":{},"f:enableServiceLinks":{},"f:restartPolicy":{},"f:schedulerName":{},"f:securityContext":{},"f:serviceAccount":{},"f:serviceAccountName":{},"f:terminationGracePeriodSeconds":{}}}},{"manager":"kubelet","operation":"Update","apiVersion":"v1","time":"2020-11-16T14:19:42Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{"f:conditions":{"k:{\"type\":\"ContainersReady\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Initialized\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}},"k:{\"type\":\"Ready\"}":{".":{},"f:lastProbeTime":{},"f:lastTransitionTime":{},"f:status":{},"f:type":{}}},"f:containerStatuses":{},"f:hostIP":{},"f:phase":{},"f:podIP":{},"f:podIPs":{".":{},"k:{\"ip\":\"10.244.1.19\"}":{".":{},"f:ip":{}}},"f:startTime":{}}}}]},"spec":{"volumes":[{"name":"eventing-webhook-token-bqv7v","secret":{"secretName":"eventing-webhook-token-bqv7v","defaultMode":420}}],"containers":[{"name":"eventing-webhook","image":"kind.local/knative.dev/eventing/cmd/webhook:1af4fd82f9a9ff68e3f5768dda777cabfe0e349429cf8289bdc3f32b533b60a4","ports":[{"name":"https-webhook","containerPort":8443,"protocol":"TCP"},{"name":"metrics","containerPort":9090,"protocol":"TCP"},{"name":"profiling","containerPort":8008,"protocol":"TCP"}],"env":[{"name":"SYSTEM_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"CONFIG_LOGGING_NAME","value":"config-logging"},{"name":"METRICS_DOMAIN","value":"knative.dev/eventing"},{"name":"WEBHOOK_NAME","value":"eventing-webhook"},{"name":"WEBHOOK_PORT","value":"8443"},{"name":"SINK_BINDING_SELECTION_MODE","value":"exclusion"},{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}}],"resources":{"limits":{"cpu":"200m","memory":"200Mi"},"requests":{"cpu":"20m","memory":"20Mi"}},"volumeMounts":[{"name":"eventing-webhook-token-bqv7v","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"livenessProbe":{"httpGet":{"path":"/","port":8443,"scheme":"HTTPS","httpHeaders":[{"name":"k-kubelet-probe","value":"webhook"}]},"initialDelaySeconds":20,"timeoutSeconds":1,"periodSeconds":1,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"httpGet":{"path":"/","port":8443,"scheme":"HTTPS","httpHeaders":[{"name":"k-kubelet-probe","value":"webhook"}]},"timeoutSeconds":1,"periodSeconds":1,"successThreshold":1,"failureThreshold":3},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"FallbackToLogsOnError","imagePullPolicy":"IfNotPresent","securityContext":{"allowPrivilegeEscalation":false}}],"restartPolicy":"Always","terminationGracePeriodSeconds":300,"dnsPolicy":"ClusterFirst","serviceAccountName":"eventing-webhook","serviceAccount":"eventing-webhook","nodeName":"kind-worker","securityContext":{},"affinity":{"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"podAffinityTerm":{"labelSelector":{"matchLabels":{"app":"eventing-webhook"}},"topologyKey":"kubernetes.io/hostname"}}]}},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priority":0,"enableServiceLinks":true},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:37Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:42Z"},{"type":"ContainersReady","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:42Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2020-11-16T14:19:36Z"}],"hostIP":"172.18.0.3","podIP":"10.244.1.19","podIPs":[{"ip":"10.244.1.19"}],"startTime":"2020-11-16T14:19:37Z","containerStatuses":[{"name":"eventing-webhook","state":{"running":{"startedAt":"2020-11-16T14:19:41Z"}},"lastState":{},"ready":true,"restartCount":0,"image":"kind.local/knative.dev/eventing/cmd/webhook:1af4fd82f9a9ff68e3f5768dda777cabfe0e349429cf8289bdc3f32b533b60a4","imageID":"sha256:a5bffea29ff5b9b24ad286ce1981725ff772e6981a6e246583282cd96e094715","containerID":"containerd://3111ec724700c26c3191da72d020c217706ad6bea200668f0caac75882561733","started":true}],"qosClass":"Burstable"}}]

Yet the test failed with:

F1116 14:20:27.856761   27973 helpers.go:115] error: timed out waiting for the condition on pods/eventing-webhook-6bd5798587-4zv5s
goroutine 1 [running]:
zhongduo commented 4 years ago

Does this look like: https://github.com/knative/eventing/issues/3244

In knative-gcp, we will get crashed webhook, but maybe knative eventing automatically restart?

vaikas commented 4 years ago

@zhongduo I don't think so because the webhook becomes ready.

zhongduo commented 4 years ago

@zhongduo I don't think so because the webhook becomes ready.

But as you said, it is a different pod already. So it might as well be that we have some logic to detect the crash or unreadiness and restart the pod, which accidentally will solve the problem.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

vaikas commented 3 years ago

This should've been fixed by: https://github.com/knative/eventing/pull/4741

Let's reopen if it comes back.