A race between K8s ready check of sender pod, and pod termination after successful send

cardil commented 2 years ago

The sender job sends the event successfully, before the Kubernetes reports the pods as ready.

~The TestInClusterSender/KnativeService/Assert/[Alpha/MUST]Event_send has become flaky, recently.~

Fails could be seen at: https://testgrid.k8s.io/r/knative-own-testgrid/kn-plugin-event#continuous

Also, here: https://prow.knative.dev/pr-history/?org=knative-sandbox&repo=kn-plugin-event&pr=241

/kind bug /triage accepted

cardil commented 1 year ago

Observed failure analysis

https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative-sandbox_kn-plugin-event/249/integration-tests_kn-plugin-event_main/1592940622510034944

The logs show a failure to send. The failure is caused by failing job (name: kn-event-sender-kb2484dn896ccnmd). The job had 2 pods: kn-event-sender-kb2484dn896ccnmd-9vph7, and kn-event-sender-kb2484dn896ccnmd-z59d2.

Pods logs are here: https://cloudlogging.app.goo.gl/a53EbCN396m36wf69

The first pod kn-event-sender-kb2484dn896ccnmd-9vph7 had failed with exit code of 205. The log is:

{"env":[…], "level":"error", "msg":"{error 26 0  can't send with ICS sender: can't sent event: Post "http://sink-kfejwrqx-ksvc.test-effperxd.svc.cluster.local": dial tcp: lookup sink-kfejwrqx-ksvc.test-effperxd.svc.cluster.local on 10.104.0.10:53: no such host}", "ts":1.6686229840400817E9}

The second pod kn-event-sender-kb2484dn896ccnmd-z59d2 failed as not ready, but apparently the logs of the pod indicate the even was successfully sent.

- lastTransitionTime: "2022-11-16T18:23:06Z"
  message: 'containers with unready status: [kn-event-sender]'
  reason: ContainersNotReady
  status: "False"
  type: Ready

Pod log:

{"ce-id":"test-event-rznrgmyf", "env":[…], "level":"info", "msg":"Event sent", "ts":1.6686229873890555E9}

From the above, a conclusion is that the flakiness is a production fault, rather than a test related. Somehow, the sender job sends the event, before the Kubernetes reports the pods as ready.

There are still unknown questions.

Why the job doesn't try to run another pod and the job fails?
Why K8s doesn't recognize the pod as ready despite the sender pod done its job?

cardil commented 1 year ago

I got some progress. After I adjusted my GKE cluster command to match our Knative one:

use gcloud beta instead of gcloud
use --addons=NodeLocalDNS

On such cluster, I reproduced exact error at the first time.

I bet that's the NodeLocalDNS addon's fault. It is to cache the DNS queries. So, it slows down the DNS changes and that's the reason I see the error on the pod:

no such host

On my previous clusters, without those add-ons I couldn't replicate this.

I think we could try using regular gcloud instead of beta variant, and without that NodeLocalDNS addon.

On such setup, I have never seen any DNS errors.

Edit

On https://github.com/knative-sandbox/kn-plugin-event/pull/249 I'm testing the integration of knative/hack#257 and https://github.com/knative/test-infra/pull/3623.

It looks stable. 10/10 sequentially executed tests passes. No DNS error, no ContainerCreating error (https://github.com/kubernetes/kubernetes/issues/98661).

I think the ContainerCreating issue still may lurk within Kubernetes, but without the initial DNS errors it is hidden now. I'm guessing it occurs only after previous pod have failed (a hunch).

Edit 2

From https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache

DNS records are cached for the following periods:

The record's time to live (TTL), or 30 seconds if the TTL is more than 30 seconds.

5 seconds if the DNS response is NXDOMAIN.

That 5 seconds cache in case of NXDOMAIN could lead to the failures I've seen. In that situation, the kube-dns already have new domain, but cache responds with no such host response.

Edit 3

It turns out the NodeLocalDNS wasn't the culprit... I did run 10 runs in a row using production GKE with NodeLocalDNS enabled. No errors. Same on https://github.com/knative-sandbox/kn-plugin-event/pull/250 already 4 runs in a row.

https://github.com/knative/hack/pull/258 will revert https://github.com/knative/hack/pull/257 (without the kubetest2 convenience install).

cardil commented 1 year ago

Closed by https://github.com/knative/test-infra/pull/3623

knative-extensions / kn-plugin-event