Open alenkacz opened 4 years ago
I've had a first look at this, and it doesn't look easy. It seems that for some reason k8s does not timout the second job in the e2e-test.
The failing test has two parts:
The first part works, the second part does not get to the timeout. The Event log does not show the expected "Deadline Exceeded" event that would indicate that the job actually times out, so I'm not sure where the issue is. It seems that KUDO correctly detects the timeout when it happens, but k8s does not trigger the timeout.
In addition to the above analysis: I've found this suspicious line in the etcd log:
2020-09-23T12:25:12.136185294Z stderr F 2020-09-23 12:25:12.136028 W |
etcdserver: read-only range request "key:\"/registry/jobs/kudo-test-smiling-eft/timeout-job\" " with result
"range_response_count:1 size:3361" took too long (215.5448ms) to execute
It seems that the request takes too long to return. I've also counted how many health requests we do for the above job resulting in:
2020-09-23T12:24:50.115387918Z stderr F 2020/09/23 12:24:50 HealthUtil: job "timeout-job" still running or failed
Withing the 5 minutes of the test, we made 1196 requests so ~4 req/s. It's not that much but still somewhat excessive. But otherwise, I don't think this is a KUDO issue. Looks more like a kind/docker flake.
I don't think that issue fixed it :)
I think the point is that there isn't much we could do from the previous logs, and we reopen once the issue resurfaces with the improved logging.
alright, I would keep it open because people tend to ignore failures in PRs and I want to be reminded via issue but let's close it then :)
Well, I would have kept it open as well :D The close was not intentional, I hoped I could find the bug and fix it in the linked PR, but it seems to happen only rarely.
I don't mind either way - I'm pretty sure we'll find the issue again if the test flakes again :)
Even with the additional logging output there's not really a reason to see why the timeout does not occour. We should probably leave this open, but I don't think there's really something we can do at this moment.
What happened: Failed test
kudo/harness/terminal-failed-job
on a PR introducing a KEP (no code)What you expected to happen: no failure
How to reproduce it (as minimally and precisely as possible): run e2e tests
Anything else we need to know?: https://app.circleci.com/pipelines/github/kudobuilder/kudo/5320/workflows/2781ca5b-9107-4e7c-b2c2-c6ccd121b615/jobs/15615/steps
Environment:
kubectl version
):kubectl kudo version
):uname -a
):