Pod in pending status when node has no connecity with external registry

roldancer commented 9 years ago

if the node doesn't have connectivity with the external registry/hub, the pod will be in pending status forever.

[oseuser@ose1 beta3]$ osc get pods POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED hello-openshift-1-ivcje 10.1.1.5 hello-openshift openshift/hello-openshift ose3.novalocal/10.78.29.4 deployment=hello-openshift-1,deploymentconfig=hello-openshift,name=hello-openshift Running 3 days tomcat-openshift 10.1.1.6 tomcat-openshift tomcat:latest ose3.novalocal/10.78.29.4 name=tomcat-openshift Pending 33 minutes

smarterclayton commented 9 years ago

@derekwaynecarr we need to investigate whether this is working as designed (try forever) and we just need to do a better job of conveying it.

derekwaynecarr commented 9 years ago

Or we could have a PendingTtl on a pod spec that is set in seconds that says how long a pod can be in Pending phase.

Would need a time stamp PodStatus for when Kubelet first observed the Pod. If a pod was in pending for an insane amount of time, fail the pod.

While I am mucking with the RunningTtl tomorrow will check out what happens here and report back.

Sent from my iPhone

On May 5, 2015, at 9:58 PM, Clayton Coleman notifications@github.com wrote:

@derekwaynecarr we need to investigate whether this is working as designed (try forever) and we just need to do a better job of conveying it.

— Reply to this email directly or view it on GitHub.

bgrant0607 commented 9 years ago

This should be reflected by ContainerStatuses[*].State.Waiting.Reason. If that's not enough detail, we should add a Message field.

In general, I'm in favor of surfacing more container status info in kubectl, as described here: https://github.com/GoogleCloudPlatform/kubernetes/issues/6014#issuecomment-86763070

cc @dchen1107

bgrant0607 commented 9 years ago

Also, we're definitely going to need to make all our components more resilient, with the ability to infer failures, back off, anti-recidivistic retry, speculative repair, etc.

derekwaynecarr commented 9 years ago

I tested a few scenarios on latest Kubernetes:

Scenario 1: Create a pod that references a non-existent image

$ cluster/kubectl.sh run no-image --image=openshift/foo --replicas=1
CONTROLLER   CONTAINER(S)   IMAGE(S)        SELECTOR       REPLICAS
no-image     no-image       openshift/foo   run=no-image   1
$ cluster/kubectl.sh get pods no-image-5fo07
POD              IP           CONTAINER(S)   IMAGE(S)        HOST                    LABELS         STATUS    CREATED     MESSAGE
no-image-5fo07   10.246.1.9                                  10.245.1.3/10.245.1.3   run=no-image   Pending   4 minutes   
                              no-image       openshift/foo                                          Waiting               Error: image openshift/foo:latest not found
$ cluster/kubectl.sh describe pods no-image-5fo07
correctly.
Name:               no-image-5fo07
Image(s):           openshift/foo
Node:               10.245.1.3/10.245.1.3
Labels:             run=no-image
Status:             Pending
Replication Controllers:    no-image (1/1 replicas created)
Containers:
  no-image:
    Image:      openshift/foo
    State:      Waiting
      Reason:       Error: image openshift/foo:latest not found
    Ready:      False
    Restart Count:  0
Conditions:
  Type      Status
  Ready     False 
Events:
  FirstSeen             LastSeen            Count   From            SubobjectPath               Reason      Message
  Tue, 02 Jun 2015 16:19:56 -0400   Tue, 02 Jun 2015 16:19:56 -0400 1   {scheduler }                            scheduled   Successfully assigned no-image-5fo07 to 10.245.1.3
  Tue, 02 Jun 2015 16:19:56 -0400   Tue, 02 Jun 2015 16:19:56 -0400 1   {kubelet 10.245.1.3}    implicitly required container POD   pulled      Successfully pulled image "gcr.io/google_containers/pause:0.8.0"
  Tue, 02 Jun 2015 16:19:56 -0400   Tue, 02 Jun 2015 16:19:56 -0400 1   {kubelet 10.245.1.3}    implicitly required container POD   created     Created with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8
  Tue, 02 Jun 2015 16:19:56 -0400   Tue, 02 Jun 2015 16:19:56 -0400 1   {kubelet 10.245.1.3}    implicitly required container POD   started     Started with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8
  Tue, 02 Jun 2015 16:19:57 -0400   Tue, 02 Jun 2015 16:25:17 -0400 33  {kubelet 10.245.1.3}    spec.containers{no-image}       failed      Failed to pull image "openshift/foo": Error: image openshift/foo:latest not found

This looks correct to me, after all the image may appear soon, and if so, it should be run. The event count is incremented forever with every kubelet synch loop, so maybe after a few weeks it could reach too big a number ;-)

Scenario 2 - Attempt to pull an image from a non-existent registry

$ cluster/kubectl.sh run no-image --image=gce.google.com/foo --replicas=1
CONTROLLER   CONTAINER(S)   IMAGE(S)             SELECTOR       REPLICAS
no-image     no-image       gce.google.com/foo   run=no-image   1
$ cluster/kubectl.sh get pods no-image-81les
POD              IP            CONTAINER(S)   IMAGE(S)             HOST                    LABELS         STATUS    CREATED      MESSAGE
no-image-81les   10.246.1.10                                       10.245.1.3/10.245.1.3   run=no-image   Pending   26 seconds   
                               no-image       gce.google.com/foo                                          Waiting                Error: image foo:latest not found

Scenario 3 - Attempt to pull an image from a blocked registry

Modified /etc/sysconfig/docker and added following:

OPTIONS='--block-registry *'

Restarted docker on the node, and ran an image that docker could not pull.

$ cluster/kubectl.sh get pods test-mpp3p
POD          IP           CONTAINER(S)   IMAGE(S)                    HOST                    LABELS     STATUS    CREATED      MESSAGE
test-mpp3p   10.246.1.4                                              10.245.1.3/10.245.1.3   run=test   Pending   31 seconds   
                          test           openshift/hello-openshift                                      Waiting                API error (500): No configured registry to pull from.

So that is good as well.

smarterclayton commented 9 years ago

Awesome testing

On Jun 2, 2015, at 4:52 PM, Derek Carr notifications@github.com wrote:

I tested a few scenarios on latest Kubernetes:

Scenario 1: Create a pod that references a non-existent image

$ cluster/kubectl.sh run no-image --image=openshift/foo --replicas=1 CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS no-image no-image openshift/foo run=no-image 1 $ cluster/kubectl.sh get pods no-image-5fo07 POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE no-image-5fo07 10.246.1.9 10.245.1.3/10.245.1.3 run=no-image Pending 4 minutes
no-image openshift/foo Waiting Error: image openshift/foo:latest not found $ cluster/kubectl.sh describe pods no-image-5fo07 correctly. Name: no-image-5fo07 Image(s): openshift/foo Node: 10.245.1.3/10.245.1.3 Labels: run=no-image Status: Pending Replication Controllers: no-image (1/1 replicas created) Containers: no-image: Image: openshift/foo State: Waiting Reason: Error: image openshift/foo:latest not found Ready: False Restart Count: 0 Conditions: Type Status Ready False Events: FirstSeen LastSeen Count From SubobjectPath Reason Message Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {scheduler } scheduled Successfully assigned no-image-5fo07 to 10.245.1.3 Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD pulled Successfully pulled image "gcr.io/google_containers/pause:0.8.0" Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD created Created with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8 Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD started Started with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8 Tue, 02 Jun 2015 16:19:57 -0400 Tue, 02 Jun 2015 16:25:17 -0400 33 {kubelet 10.245.1.3} spec.containers{no-image} failed Failed to pull image "openshift/foo": Error: image openshift/foo:latest not found This looks correct to me, after all the image may appear soon, and if so, it should be run. The event count is incremented forever with every kubelet synch loop, so maybe after a few weeks it could reach too big a number ;-)

Scenario 2 - Attempt to pull an image from a non-existent registry

$ cluster/kubectl.sh run no-image --image=gce.google.com/foo --replicas=1 CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS no-image no-image gce.google.com/foo run=no-image 1 $ cluster/kubectl.sh get pods no-image-81les POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE no-image-81les 10.246.1.10 10.245.1.3/10.245.1.3 run=no-image Pending 26 seconds
no-image gce.google.com/foo Waiting Error: image foo:latest not found Scenario 3 - Attempt to pull an image from a blocked registry

Modified /etc/sysconfig/docker and added following:

OPTIONS='--block-registry *' Restarted docker on the node, and ran an image that docker could not pull.

$ cluster/kubectl.sh get pods test-mpp3p POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE test-mpp3p 10.246.1.4 10.245.1.3/10.245.1.3 run=test Pending 31 seconds
test openshift/hello-openshift Waiting API error (500): No configured registry to pull from. So that is good as well.

— Reply to this email directly or view it on GitHub.

derekwaynecarr commented 9 years ago

Given the observed behavior in the reported testing, I think it's safe to close this issue, and just note that we will continue to try to pull the image in perpetuity but the system does report an event on why the image could or could not be retrieved, and now displays a message.

openshift / origin

Pod in pending status when node has no connecity with external registry #2048