Closed cardil closed 1 year ago
The logs show a failure to send. The failure is caused by failing job (name: kn-event-sender-kb2484dn896ccnmd
). The job had 2 pods: kn-event-sender-kb2484dn896ccnmd-9vph7
, and kn-event-sender-kb2484dn896ccnmd-z59d2
.
Pods logs are here: https://cloudlogging.app.goo.gl/a53EbCN396m36wf69
The first pod kn-event-sender-kb2484dn896ccnmd-9vph7
had failed with exit code of 205. The log is:
{"env":[…], "level":"error", "msg":"{error 26 0 can't send with ICS sender: can't sent event: Post "http://sink-kfejwrqx-ksvc.test-effperxd.svc.cluster.local": dial tcp: lookup sink-kfejwrqx-ksvc.test-effperxd.svc.cluster.local on 10.104.0.10:53: no such host}", "ts":1.6686229840400817E9}
The second pod kn-event-sender-kb2484dn896ccnmd-z59d2
failed as not ready, but apparently the logs of the pod indicate the even was successfully sent.
- lastTransitionTime: "2022-11-16T18:23:06Z"
message: 'containers with unready status: [kn-event-sender]'
reason: ContainersNotReady
status: "False"
type: Ready
Pod log:
{"ce-id":"test-event-rznrgmyf", "env":[…], "level":"info", "msg":"Event sent", "ts":1.6686229873890555E9}
From the above, a conclusion is that the flakiness is a production fault, rather than a test related. Somehow, the sender job sends the event, before the Kubernetes reports the pods as ready.
There are still unknown questions.
I got some progress. After I adjusted my GKE cluster command to match our Knative one:
gcloud beta
instead of gcloud
--addons=NodeLocalDNS
On such cluster, I reproduced exact error at the first time.
I bet that's the NodeLocalDNS
addon's fault. It is to cache the DNS queries. So, it slows down the DNS changes and that's the reason I see the error on the pod:
no such host
On my previous clusters, without those add-ons I couldn't replicate this.
I think we could try using regular gcloud
instead of beta
variant, and without that NodeLocalDNS
addon.
On such setup, I have never seen any DNS errors.
On https://github.com/knative-sandbox/kn-plugin-event/pull/249 I'm testing the integration of knative/hack#257 and https://github.com/knative/test-infra/pull/3623.
It looks stable. 10/10 sequentially executed tests passes. No DNS error, no ContainerCreating error (https://github.com/kubernetes/kubernetes/issues/98661).
I think the ContainerCreating issue still may lurk within Kubernetes, but without the initial DNS errors it is hidden now. I'm guessing it occurs only after previous pod have failed (a hunch).
From https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache
DNS records are cached for the following periods:
- The record's time to live (TTL), or 30 seconds if the TTL is more than 30 seconds.
- 5 seconds if the DNS response is NXDOMAIN.
That 5 seconds cache in case of NXDOMAIN could lead to the failures I've seen. In that situation, the kube-dns already have new domain, but cache responds with no such host
response.
It turns out the NodeLocalDNS wasn't the culprit... I did run 10 runs in a row using production GKE with NodeLocalDNS enabled. No errors. Same on https://github.com/knative-sandbox/kn-plugin-event/pull/250 already 4 runs in a row.
https://github.com/knative/hack/pull/258 will revert https://github.com/knative/hack/pull/257 (without the kubetest2 convenience install).
The sender job sends the event successfully, before the Kubernetes reports the pods as ready.
~The
TestInClusterSender/KnativeService/Assert/[Alpha/MUST]Event_send
has become flaky, recently.~Fails could be seen at: https://testgrid.k8s.io/r/knative-own-testgrid/kn-plugin-event#continuous
Also, here: https://prow.knative.dev/pr-history/?org=knative-sandbox&repo=kn-plugin-event&pr=241
/kind bug /triage accepted