openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.48k stars 4.7k forks source link

e2e database pod never comes up #14090

Closed stevekuznetsov closed 7 years ago

stevekuznetsov commented 7 years ago
Running test/end-to-end/core.sh:15: executing 'oc get -n test pods -l name=database' expecting any result and text 'Running'; re-trying every 0.2s until completion or 60.000s...
FAILURE after 59.904s: test/end-to-end/core.sh:15: executing 'oc get -n test pods -l name=database' expecting any result and text 'Running'; re-trying every 0.2s until completion or 60.000s: the command timed out
Standard output from the command:
Standard error from the command:
No resources found.
... repeated 131 times

Seen in this job. @bparees can't really make heads or tails of this but we have all the pod logs and the master log in the artifacts so hopefully we can determine what went wrong.

bparees commented 7 years ago

@stevekuznetsov the etcd dump is pretty much empty. do we have confidence that the etcd dump normally works?

@knobunc can you comment on these errors as seen in the origin.log:

W0506 05:02:49.178089    2247 docker_sandbox.go:263] Couldn't find network status for test/ruby-sample-build-1-build through plugin: invalid network status for

? are they benign? anything we (as the build pod owners) should be doing differently to avoid them?

the database deploy log appears to show things being stuck in running the mid hook pod (despite the DC in question not having a mid-hook.. this is the application-template-stibuild.json template), and there are no logs for those hooks.

@mfojtik seems like a deploy issue, based on the oc get pods output it seems like the DB pod is never even getting created.

stevekuznetsov commented 7 years ago

@deads2k @sttts did we implement etcdv3 dump for these tests? Or are we still just doing v2/keys?recursive=true?

deads2k commented 7 years ago

@deads2k @sttts did we implement etcdv3 dump for these tests? Or are we still just doing v2/keys?recursive=true?

I don't remember doing it.

stevekuznetsov commented 7 years ago

I'll add it to the post-rebase tasks

deads2k commented 7 years ago

I'll add it to the post-rebase tasks

Not a great place for it. etcd3 has been around for ages.

stevekuznetsov commented 7 years ago

Wasn't it turned on by default with the rebase?

deads2k commented 7 years ago

Wasn't it turned on by default with the rebase?

I thought that happened back in 1.4 or 1.5 and we rolled it back for unrelated reasons.

stevekuznetsov commented 7 years ago

At the time it was first introduced I also asked for this dump, but it became unnecessary when it was turned off. Whoever pulls the trigger to turn it on owns this issue -- you can't turn it on and walk away without making sure the tests are outputting reasonable sets of debugging artifacts. I'm not particularly interested in arguing about who should or should not own this but if you feel strongly and want to make an issue, triage it to someone else and make sure they understand that is is a high priority I have no issue with removing the item from the post-rebase task list.

deads2k commented 7 years ago

An issue already exists: #11837 . It's been pre-existing for 6 months. It isn't a 1.6 rebase blocker.

stevekuznetsov commented 7 years ago

Sure, an issue was made at the time to track the issue. It hasn't been relevant to the product or the tests since then because we have not been using v3. I don't understand your point. It is valid and important now. I don't think I'd say it was a release blocker, as test infrastructure is not necessarily ever going to be in that category, but as I said before -- the engineers responsible for turning on v3 by default should also do the right thing for the community and update the test infrastructure so other engineers like @bparees can be effective when investigating test failures.

mfojtik commented 7 years ago

Closing as dupe and due to age.