Flake: API server did not come up successfully

kubernetes-retired / service-catalog

Consume services in Kubernetes using the Open Service Broker API

https://svc-cat.io

Apache License 2.0

1.05k stars 385 forks source link

Flake: API server did not come up successfully #643

Closed arschles closed 7 years ago

arschles commented 7 years ago

Jenkins build #106, which started from https://github.com/kubernetes-incubator/service-catalog/pull/642, caused a flake. The offending logs seem to be these:

+ error_exit 'API server pod did not come up successfully.'
+ echo '/var/lib/jenkins/workspace/service-catalog-PR-testing2/src/github.com/kubernetes-incubator/service-catalog/contrib/hack/test_walkthrough.sh: line 85: API server pod did not come up successfully. (exit 1)'
/var/lib/jenkins/workspace/service-catalog-PR-testing2/src/github.com/kubernetes-incubator/service-catalog/contrib/hack/test_walkthrough.sh: line 85: API server pod did not come up successfully. (exit 1)

cc/ @kibbles-n-bytes

arschles commented 7 years ago

The same flake appears to be in https://service-catalog-jenkins.appspot.com/job/service-catalog-PR-testing2/142/console, except the controller pod was reported as not up successfully:

+ error_exit 'Controller pod did not come up successfully.'
+ echo '/var/lib/jenkins/workspace/service-catalog-PR-testing2/src/github.com/kubernetes-incubator/service-catalog/contrib/hack/test_walkthrough.sh: line 87: Controller pod did not come up successfully. (exit 1)'
/var/lib/jenkins/workspace/service-catalog-PR-testing2/src/github.com/kubernetes-incubator/service-catalog/contrib/hack/test_walkthrough.sh: line 87: Controller pod did not come up successfully. (exit 1)

MHBauer commented 7 years ago

Is this issue more pod focused or on the random issues we seem to be having with servers?

I've seen the a lot of local integration test failures lately. Due to timeouts waiting for servers to come up. Some of the time it seems like it's failures to contact after what should have been successful uses, meaning it's somehow becoming uncontactable after running for a little while.

arschles commented 7 years ago

@MHBauer I've seen the same flakes in travis's integration tests. those do not run our components inside pods afaik, correct?

MHBauer commented 7 years ago

Correct. I have not seen those myself, hence my description. Sounds like there's some general weirdness then.

kibbles-n-bytes commented 7 years ago

Definitely general weirdness. There seem to be three different things being reported in this issue:

API server image pull error
Controller-manager crash loop
Integration test timeout failures

I think some of the Jenkins changes that went in the other day should alleviate the first two. The integration test failures are a separate issue.

I'm going to close this issue for now and make a separate one just for the integration tests timing out. If we see any non-integration-test flakes, then let's make separate issues for them as they come.