Timeout: Waiting for a default service account to be provisioned in namespace

bparees commented 6 years ago

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17092/test_pull_request_origin_extended_conformance_install/2451/

/tmp/openshift/build-rpm-release/rpm/BUILD/origin-3.8.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:130
Expected error:
    <*errors.errorString | 0xc42027d1c0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred
/tmp/openshift/build-rpm-release/rpm/BUILD/origin-3.8.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:208

[BeforeEach] [Top Level]
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:53
[BeforeEach] [Feature:ImageLookup][registry] Image policy
  /tmp/openshift/build-rpm-release/rpm/BUILD/origin-3.8.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:130
STEP: Creating a kubernetes client
Nov 15 12:21:26.205: INFO: >>> kubeConfig: /etc/origin/master/admin.kubeconfig
STEP: Building a namespace api object
Nov 15 12:21:26.259: INFO: configPath is now "/tmp/extended-test-resolve-local-names-42xrk-2hjn6-user.kubeconfig"
Nov 15 12:21:26.259: INFO: The user is now "extended-test-resolve-local-names-42xrk-2hjn6-user"
Nov 15 12:21:26.259: INFO: Creating project "extended-test-resolve-local-names-42xrk-2hjn6"
Nov 15 12:21:26.363: INFO: Waiting on permissions in project "extended-test-resolve-local-names-42xrk-2hjn6" ...
STEP: Waiting for a default service account to be provisioned in namespace

The default timeout for this is 2 minutes.

bparees commented 6 years ago

still seeing this: https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_builds/348/

bparees commented 6 years ago

@mfojtik we're still seeing this a lot in our extended test runs. Any suggestions?

/cc @derekwaynecarr

https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_image_ecosystem/359/#showFailuresLink

bparees commented 6 years ago

/cc @liggitt @deads2k

deads2k commented 6 years ago

@mfojtik we're still seeing this a lot in our extended test runs. Any suggestions?

"Still" or "just started again". What percentage of jobs are failing on it?

@stevekuznetsov I've got master and node metrics, but not controller metrics. Where is the script that describes what to gather?

bparees commented 6 years ago

@deads2k i've only started looking into this again but the last 3 times i ran our extended tests, I saw this in several test failures for each run. so, 100% over the last few days, that i've looked at.

As to whether there was ever a period in recent history where it wasn't happening, i'm not sure.

bparees commented 6 years ago

(i'm also not sure why our extended jobs would be particularly vulnerable to it, vs conformance jobs)

deads2k commented 6 years ago

@deads2k i've only started looking into this again but the last 3 times i ran our extended tests, I saw this in several test failures for each run. so, 100% over the last few days, that i've looked at.

Is there any sane way for you to see if it spiked about two weeks ago? I re-sliced some startup code that seemed to significantly improve our normal CI, but if it suddenly started spiking, that's where I'd be starting my search.

stevekuznetsov commented 6 years ago

What do you mean by "controller metrics"? We dump pprof output but I'm not sure we do anything from Prometheus?

bparees commented 6 years ago

Is there any sane way for you to see if it spiked about two weeks ago?

not really. our extended test jobs have been a mess for a month and a half due to storage issues and devmapper issues and I don't really want to try to weed through that.

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 6 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

bparees commented 6 years ago

@smarterclayton heh......

/remove-lifecycle rotten /lifecycle frozen

@deads2k @mfojtik @smarterclayton indicated he has a bug open for this (issues w/ the service account controller getting bogged down)

openshift / origin

Timeout: Waiting for a default service account to be provisioned in namespace #17325