Flaky e2e test - terminal-failed-job

kudobuilder / kudo

Kubernetes Universal Declarative Operator (KUDO)

https://kudo.dev

Apache License 2.0

1.18k stars 103 forks source link

Flaky e2e test - terminal-failed-job #1691

Open alenkacz opened 4 years ago

alenkacz commented 4 years ago

What happened: Failed test kudo/harness/terminal-failed-job on a PR introducing a KEP (no code)

What you expected to happen: no failure

How to reproduce it (as minimally and precisely as possible): run e2e tests

Anything else we need to know?: https://app.circleci.com/pipelines/github/kudobuilder/kudo/5320/workflows/2781ca5b-9107-4e7c-b2c2-c6ccd121b615/jobs/15615/steps

Environment:

Kubernetes version (use kubectl version):
Kudo version (use kubectl kudo version):
Operator:
operatorVersion:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

ANeumann82 commented 4 years ago

I've had a first look at this, and it doesn't look easy. It seems that for some reason k8s does not timout the second job in the e2e-test.

The failing test has two parts:

A Job that fails after 3 tries and brings the KUDO plan to a FATAL_ERROR state
A Job that times out after 60 seconds and brings the KUDO plan to a FATAL_ERROR state

The first part works, the second part does not get to the timeout. The Event log does not show the expected "Deadline Exceeded" event that would indicate that the job actually times out, so I'm not sure where the issue is. It seems that KUDO correctly detects the timeout when it happens, but k8s does not trigger the timeout.

zen-dog commented 4 years ago

In addition to the above analysis: I've found this suspicious line in the etcd log:

2020-09-23T12:25:12.136185294Z stderr F 2020-09-23 12:25:12.136028 W | 
etcdserver: read-only range request "key:\"/registry/jobs/kudo-test-smiling-eft/timeout-job\" " with result 
"range_response_count:1 size:3361" took too long (215.5448ms) to execute

It seems that the request takes too long to return. I've also counted how many health requests we do for the above job resulting in:

2020-09-23T12:24:50.115387918Z stderr F 2020/09/23 12:24:50 HealthUtil: job "timeout-job" still running or failed

Withing the 5 minutes of the test, we made 1196 requests so ~4 req/s. It's not that much but still somewhat excessive. But otherwise, I don't think this is a KUDO issue. Looks more like a kind/docker flake.

alenkacz commented 4 years ago

I don't think that issue fixed it :)

porridge commented 4 years ago

I think the point is that there isn't much we could do from the previous logs, and we reopen once the issue resurfaces with the improved logging.

alenkacz commented 4 years ago

alright, I would keep it open because people tend to ignore failures in PRs and I want to be reminded via issue but let's close it then :)

ANeumann82 commented 4 years ago

Well, I would have kept it open as well :D The close was not intentional, I hoped I could find the bug and fix it in the linked PR, but it seems to happen only rarely.

I don't mind either way - I'm pretty sure we'll find the issue again if the test flakes again :)

alenkacz commented 3 years ago

Happened again https://app.circleci.com/pipelines/github/kudobuilder/kudo/5564/workflows/2d82fa0a-716d-4087-9995-3a47781e19f5/jobs/16874/tests

ANeumann82 commented 3 years ago

Even with the additional logging output there's not really a reason to see why the timeout does not occour. We should probably leave this open, but I don't think there's really something we can do at this moment.