Closed steven166 closed 6 years ago
So it could be stuck in a number of places:
oc describe rs
and make sure that their is a pod trying to start. oc get events
to see if the pod is being seen by the scheduler and assigned a node.oc get events
.After some time searching the logs and a couple of restarting different components, I've found that restarting one master-controller fixed it. So somehow it was probably stuck or not working correctly, without providing any logging or health indicating this.
Anyway thanks for those instructions, I'll keep them in mind for the next time. But is it possible to provide some logging or health checks for these kinds of bugs? as we had no clue where the problem was.
I know your case would need more diagnosis, but it could also be your environment/cloud provider changing DNS resolver, which affects your Node service uptime, which in turn causes your Pods to get stuck.
Similar issue if you're interested:
https://github.com/melvz/adop-docker-compose/wiki/How-to-deploy-ADOP-using-docker-compose-----------------(NOT-quickstart.sh!!!
And yes, I ended up restarting the Master controller and the node service every 24 hours.
~
~
Not sure if that is directly related, as the status of our nodes were all showing ready. (btw your link is broken)
@joelsmith PTAL
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
Suddenly Openshift is not able to schedule new pods: when a pod is deleted or when a new deployment starts. Any Idea what is causing this? Could this be handled with priority as we cannot restart any pods in our production cluster right now.
Thanks in Advanced!
Version
oc v1.5.0 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://mxt-ocmaster.newyse.maxxton:8443 openshift v1.5.0 kubernetes v1.5.2+43a9be4
Steps To Reproduce
Current Result
Expected Result
Additional Information