Unable to schedule new pods

steven166 commented 7 years ago

Suddenly Openshift is not able to schedule new pods: when a pod is deleted or when a new deployment starts. Any Idea what is causing this? Could this be handled with priority as we cannot restart any pods in our production cluster right now.

Thanks in Advanced!

Version

oc v1.5.0 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://mxt-ocmaster.newyse.maxxton:8443 openshift v1.5.0 kubernetes v1.5.2+43a9be4

Steps To Reproduce

Delete a pod from a replica controller

Current Result

replica controller is showing 'scaling 0 to 1', but the new pod is never creating

Expected Result

The replica controller is starting the new pod

Additional Information

[Note] Determining if client configuration exists for client/cluster diagnostics
Info:  Successfully read a client config file at '/root/.kube/config'
Info:  Using context for cluster-admin access: 'mxts-prod/mxt-ocmaster-newyse-maxxton:8443/system:admin'
[Note] Performing systemd discovery

[Note] Running diagnostic: ConfigContexts[openshift/mxt-ocmaster-newyse-maxxton:8443/system:admin]
       Description: Validate client config context is complete and has connectivity

Info:  For client config context 'openshift/mxt-ocmaster-newyse-maxxton:8443/system:admin':
       The server URL is 'https://mxt-ocmaster.newyse.maxxton:8443'
       The user authentication is 'system:admin/mxt-ocmaster-newyse-maxxton:8443'
       The current project is 'openshift'
       Successfully requested project list; has access to project(s):
         [default kube-system logging management-infra mcms-acc mcms-prod mxts-acc mxts-prod network-diag-ns-0bb8q network-diag-ns-4xs54 ...]

[Note] Running diagnostic: DiagnosticPod
       Description: Create a pod to run diagnostics from the application standpoint

Info:  Output from the diagnostic pod (image openshift/origin-deployer:v1.5.0):
       [Note] Running diagnostic: PodCheckAuth
              Description: Check that service account credentials authenticate as expected

       Info:  Service account token successfully authenticated to master
       Info:  Service account token was authenticated by the integrated registry.
       [Note] Running diagnostic: PodCheckDns
              Description: Check that DNS within a pod works as expected

       [Note] Summary of diagnostics execution (version v1.5.0+031cbe4):
       [Note] Completed with no errors or warnings seen.

[Note] Running diagnostic: NetworkCheck
       Description: Create a pod on all schedulable nodes and run network diagnostics from the application standpoint

ERROR: [DNet3001 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/setup.go:62]
       Failed to create network diags test pod and service: [Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-r1nl2' on node "mxt-ocinfra01.newyse.maxxton" failed: pods "network-diag-test-pod-r1nl2" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-rg890' on node "mxt-ocinfra01.newyse.maxxton" failed: pods "network-diag-test-pod-rg890" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-tgnjl' on node "mxt-ocinfra01.newyse.maxxton" failed: pods "network-diag-test-pod-tgnjl" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-b6f7c' on node "mxt-ocinfra01.newyse.maxxton" failed: pods "network-diag-test-pod-b6f7c" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-b8h1t' on node "mxt-ocrouter01.newyse.maxxton" failed: pods "network-diag-test-pod-b8h1t" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-2lxd8' on node "mxt-ocrouter01.newyse.maxxton" failed: pods "network-diag-test-pod-2lxd8" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-kbzc1' on node "mxt-ocrouter01.newyse.maxxton" failed: pods "network-diag-test-pod-kbzc1" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-7vv9h' on node "mxt-ocrouter01.newyse.maxxton" failed: pods "network-diag-test-pod-7vv9h" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-gwpld' on node "mxt-ocrouter02.newyse.maxxton" failed: pods "network-diag-test-pod-gwpld" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-f0l5j' on node "mxt-ocrouter02.newyse.maxxton" failed: pods "network-diag-test-pod-f0l5j" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-dwbxl' on node "mxt-ocrouter02.newyse.maxxton" failed: pods "network-diag-test-pod-dwbxl" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-0b7g9' on node "mxt-ocrouter02.newyse.maxxton" failed: pods "network-diag-test-pod-0b7g9" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-dngdh' on node "mxt-ocrouter03.newyse.maxxton" failed: pods "network-diag-test-pod-dngdh" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-qq3kj' on node "mxt-ocrouter03.newyse.maxxton" failed: pods "network-diag-test-pod-qq3kj" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-6l223' on node "mxt-ocrouter03.newyse.maxxton" failed: pods "network-diag-test-pod-6l223" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-1d9jt' on node "mxt-ocrouter03.newyse.maxxton" failed: pods "network-diag-test-pod-1d9jt" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-nt9jz' on node "mxt-ocrouter04.newyse.maxxton" failed: pods "network-diag-test-pod-nt9jz" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-g61bg' on node "mxt-ocrouter04.newyse.maxxton" failed: pods "network-diag-test-pod-g61bg" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-dtzqz' on node "mxt-ocrouter04.newyse.maxxton" failed: pods "network-diag-test-pod-dtzqz" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-0xq22' on node "mxt-ocrouter04.newyse.maxxton" failed: pods "network-diag-test-pod-0xq22" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-hvsjc' on node "mxt-ocslave01.newyse.maxxton" failed: pods "network-diag-test-pod-hvsjc" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-q0275' on node "mxt-ocslave01.newyse.maxxton" failed: pods "network-diag-test-pod-q0275" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-vfbzs' on node "mxt-ocslave01.newyse.maxxton" failed: pods "network-diag-test-pod-vfbzs" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-m8f1s' on node "mxt-ocslave01.newyse.maxxton" failed: pods "network-diag-test-pod-m8f1s" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-g989c' on node "mxt-ocslave02.newyse.maxxton" failed: pods "network-diag-test-pod-g989c" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-323s8' on node "mxt-ocslave02.newyse.maxxton" failed: pods "network-diag-test-pod-323s8" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-nwxg1' on node "mxt-ocslave02.newyse.maxxton" failed: pods "network-diag-test-pod-nwxg1" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-c1pfb' on node "mxt-ocslave02.newyse.maxxton" failed: pods "network-diag-test-pod-c1pfb" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-1q1jw' on node "mxt-ocslave03.newyse.maxxton" failed: pods "network-diag-test-pod-1q1jw" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-9cchz' on node "mxt-ocslave03.newyse.maxxton" failed: pods "network-diag-test-pod-9cchz" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-rjs1n' on node "mxt-ocslave03.newyse.maxxton" failed: pods "network-diag-test-pod-rjs1n" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-x25fg' on node "mxt-ocslave03.newyse.maxxton" failed: pods "network-diag-test-pod-x25fg" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-f2v11' on node "mxt-ocslave04.newyse.maxxton" failed: pods "network-diag-test-pod-f2v11" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-mdc1b' on node "mxt-ocslave04.newyse.maxxton" failed: pods "network-diag-test-pod-mdc1b" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-0vzz2' on node "mxt-ocslave04.newyse.maxxton" failed: pods "network-diag-test-pod-0vzz2" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-dh24j' on node "mxt-ocslave04.newyse.maxxton" failed: pods "network-diag-test-pod-dh24j" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-41q0n' on node "mxt-ocslave05.newyse.maxxton" failed: pods "network-diag-test-pod-41q0n" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-bzwbg' on node "mxt-ocslave05.newyse.maxxton" failed: pods "network-diag-test-pod-bzwbg" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-c7s5g' on node "mxt-ocslave05.newyse.maxxton" failed: pods "network-diag-test-pod-c7s5g" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-q471h' on node "mxt-ocslave05.newyse.maxxton" failed: pods "network-diag-test-pod-q471h" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-7rq8k' on node "mxt-ocslave06.newyse.maxxton" failed: pods "network-diag-test-pod-7rq8k" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-c5k72' on node "mxt-ocslave06.newyse.maxxton" failed: pods "network-diag-test-pod-c5k72" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-96jdk' on node "mxt-ocslave06.newyse.maxxton" failed: pods "network-diag-test-pod-96jdk" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-c2xpw' on node "mxt-ocslave06.newyse.maxxton" failed: pods "network-diag-test-pod-c2xpw" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-vb870' on node "mxt-ocslave07.newyse.maxxton" failed: pods "network-diag-test-pod-vb870" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-mvrnt' on node "mxt-ocslave07.newyse.maxxton" failed: pods "network-diag-test-pod-mvrnt" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-cw0hs' on node "mxt-ocslave07.newyse.maxxton" failed: pods "network-diag-test-pod-cw0hs" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-sk0cs' on node "mxt-ocslave07.newyse.maxxton" failed: pods "network-diag-test-pod-sk0cs" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-97vgn' on node "mxt-ocslave08.newyse.maxxton" failed: pods "network-diag-test-pod-97vgn" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-9cpzk' on node "mxt-ocslave08.newyse.maxxton" failed: pods "network-diag-test-pod-9cpzk" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-ccsw2' on node "mxt-ocslave08.newyse.maxxton" failed: pods "network-diag-test-pod-ccsw2" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-p48gl' on node "mxt-ocslave08.newyse.maxxton" failed: pods "network-diag-test-pod-p48gl" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-465n7' on node "mxt-ocslave09.newyse.maxxton" failed: pods "network-diag-test-pod-465n7" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-cg7t3' on node "mxt-ocslave09.newyse.maxxton" failed: pods "network-diag-test-pod-cg7t3" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-58xwg' on node "mxt-ocslave09.newyse.maxxton" failed: pods "network-diag-test-pod-58xwg" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-f9mhn' on node "mxt-ocslave09.newyse.maxxton" failed: pods "network-diag-test-pod-f9mhn" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-p6sxt' on node "mxt-ocslave10.newyse.maxxton" failed: pods "network-diag-test-pod-p6sxt" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-q9pmh' on node "mxt-ocslave10.newyse.maxxton" failed: pods "network-diag-test-pod-q9pmh" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-64ttr' on node "mxt-ocslave10.newyse.maxxton" failed: pods "network-diag-test-pod-64ttr" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-lbx16' on node "mxt-ocslave10.newyse.maxxton" failed: pods "network-diag-test-pod-lbx16" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-vzpdn' on node "mxt-ocslave11.newyse.maxxton" failed: pods "network-diag-test-pod-vzpdn" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-qxfpj' on node "mxt-ocslave11.newyse.maxxton" failed: pods "network-diag-test-pod-qxfpj" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-txrh4' on node "mxt-ocslave11.newyse.maxxton" failed: pods "network-diag-test-pod-txrh4" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-gx9gd' on node "mxt-ocslave11.newyse.maxxton" failed: pods "network-diag-test-pod-gx9gd" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-3xscd' on node "mxt-ocslave12.newyse.maxxton" failed: pods "network-diag-test-pod-3xscd" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-rtsvw/network-diag-test-pod-bjx2x' on node "mxt-ocslave12.newyse.maxxton" failed: pods "network-diag-test-pod-bjx2x" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-n3r6d' on node "mxt-ocslave12.newyse.maxxton" failed: pods "network-diag-test-pod-n3r6d" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created, Creating network diagnostic test pod 'network-diag-ns-4tw6p/network-diag-test-pod-m6jhc' on node "mxt-ocslave12.newyse.maxxton" failed: pods "network-diag-test-pod-m6jhc" is forbidden: service account network-diag-ns-4tw6p/default was not found, retry after the service account is created]

ERROR: [DNet2006 from diagnostic NetworkCheck@openshift/origin/pkg/diagnostics/network/run_pod.go:136]
       Creating network diagnostic pod "network-diag-pod-45dx8" on node "mxt-ocinfra01.newyse.maxxton" with command "openshift infra network-diagnostic-pod -l 1" failed: pods "network-diag-pod-45dx8" is forbidden: service account network-diag-ns-rtsvw/default was not found, retry after the service account is created

[Note] Skipping diagnostic: AggregatedLogging
       Description: Check aggregated logging integration for proper configuration
       Because: No LoggingPublicURL is defined in the master configuration

[Note] Running diagnostic: ClusterRegistry
       Description: Check that there is a working Docker registry

[Note] Running diagnostic: ClusterRoleBindings
       Description: Check that the default ClusterRoleBindings are present and contain the expected subjects

Info:  clusterrolebinding/cluster-admins has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-admins has extra subject {User  vxadm-mol    }.
Info:  clusterrolebinding/cluster-admins has extra subject {Group  ug-vx-admin    }.
Info:  clusterrolebinding/cluster-admins has extra subject {Group  ug-mx-openshift    }.

Info:  clusterrolebinding/cluster-readers has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount management-infra management-admin    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount default router    }.
Info:  clusterrolebinding/cluster-readers has extra subject {ServiceAccount openshift-infra heapster    }.

Info:  clusterrolebinding/self-provisioners has more subjects than expected.

       Use the `oadm policy reconcile-cluster-role-bindings` command to update the role binding to remove extra subjects.

Info:  clusterrolebinding/self-provisioners has extra subject {ServiceAccount management-infra management-admin    }.

[Note] Running diagnostic: ClusterRoles
       Description: Check that the default ClusterRoles are present and contain the expected permissions

[Note] Running diagnostic: ClusterRouterName
       Description: Check there is a working router

[Note] Running diagnostic: MasterNode
       Description: Check if master is also running node (for Open vSwitch)

Info:  Found a node with same IP as master: mxt-ocrouter01.newyse.maxxton

[Note] Running diagnostic: MetricsApiProxy
       Description: Check the integrated heapster metrics can be reached via the API proxy

[Note] Running diagnostic: NodeDefinitions
       Description: Check node records on master

WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node mxt-ocmaster01.newyse.maxxton is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node mxt-ocmaster01.newyse.maxxton --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').

WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node mxt-ocmaster02.newyse.maxxton is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node mxt-ocmaster02.newyse.maxxton --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').

WARN:  [DClu0003 from diagnostic NodeDefinition@openshift/origin/pkg/diagnostics/cluster/node_definitions.go:112]
       Node mxt-ocmaster03.newyse.maxxton is ready but is marked Unschedulable.
       This is usually set manually for administrative reasons.
       An administrator can mark the node schedulable with:
           oadm manage-node mxt-ocmaster03.newyse.maxxton --schedulable=true

       While in this state, pods should not be scheduled to deploy on the node.
       Existing pods will continue to run until completed or evacuated (see
       other options for 'oadm manage-node').

[Note] Running diagnostic: ServiceExternalIPs
       Description: Check for existing services with ExternalIPs that are disallowed by master config

[Note] Running diagnostic: AnalyzeLogs
       Description: Check for recent problems in systemd service logs

Info:  Checking journalctl logs for 'origin-node' service
Info:  Checking journalctl logs for 'docker' service

[Note] Running diagnostic: MasterConfigCheck
       Description: Check the master config file

WARN:  [DH0005 from diagnostic MasterConfigCheck@openshift/origin/pkg/diagnostics/host/check_master_config.go:52]
       Validation of master config file '/etc/origin/master/master-config.yaml' warned:
       assetConfig.loggingPublicURL: Invalid value: "": required to view aggregated container logs in the console
       oauthConfig.identityProvider[0].provider.insecure: Invalid value: true: validating passwords over an insecure connection could allow them to be intercepted
       auditConfig.auditFilePath: Required value: audit can now be logged to a separate file

[Note] Running diagnostic: NodeConfigCheck
       Description: Check the node config file

Info:  Found a node config file: /etc/origin/node/node-config.yaml

[Note] Running diagnostic: UnitStatus
       Description: Check status for related systemd units

[Note] Summary of diagnostics execution (version v1.5.0):
[Note] Warnings seen: 4
[Note] Errors seen: 2

sjenning commented 7 years ago

So it could be stuck in a number of places:

Is the new pod being created as a resource? Look at oc describe rs and make sure that their is a pod trying to start.
Is the pod being seen by the scheduler? Look at oc get events to see if the pod is being seen by the scheduler and assigned a node.
Is the pod failing to start on the node? Look at the logs of the assigned node and see if the pod is failing to start for some reason. Most failures on the node are reported as events as well oc get events.

steven166 commented 7 years ago

After some time searching the logs and a couple of restarting different components, I've found that restarting one master-controller fixed it. So somehow it was probably stuck or not working correctly, without providing any logging or health indicating this.

Anyway thanks for those instructions, I'll keep them in mind for the next time. But is it possible to provide some logging or health checks for these kinds of bugs? as we had no clue where the problem was.

melvz commented 7 years ago

I know your case would need more diagnosis, but it could also be your environment/cloud provider changing DNS resolver, which affects your Node service uptime, which in turn causes your Pods to get stuck.

Similar issue if you're interested:

 https://github.com/melvz/adop-docker-compose/wiki/How-to-deploy-ADOP-using-docker-compose-----------------(NOT-quickstart.sh!!!

And yes, I ended up restarting the Master controller and the node service every 24 hours.

~

steven166 commented 7 years ago

Not sure if that is directly related, as the status of our nodes were all showing ready. (btw your link is broken)

sjenning commented 6 years ago

@joelsmith PTAL

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 6 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 6 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift / origin