Closed roldancer closed 9 years ago
@derekwaynecarr we need to investigate whether this is working as designed (try forever) and we just need to do a better job of conveying it.
Or we could have a PendingTtl on a pod spec that is set in seconds that says how long a pod can be in Pending phase.
Would need a time stamp PodStatus for when Kubelet first observed the Pod. If a pod was in pending for an insane amount of time, fail the pod.
While I am mucking with the RunningTtl tomorrow will check out what happens here and report back.
Sent from my iPhone
On May 5, 2015, at 9:58 PM, Clayton Coleman notifications@github.com wrote:
@derekwaynecarr we need to investigate whether this is working as designed (try forever) and we just need to do a better job of conveying it.
— Reply to this email directly or view it on GitHub.
This should be reflected by ContainerStatuses[*].State.Waiting.Reason
. If that's not enough detail, we should add a Message field.
In general, I'm in favor of surfacing more container status info in kubectl, as described here: https://github.com/GoogleCloudPlatform/kubernetes/issues/6014#issuecomment-86763070
cc @dchen1107
Also, we're definitely going to need to make all our components more resilient, with the ability to infer failures, back off, anti-recidivistic retry, speculative repair, etc.
I tested a few scenarios on latest Kubernetes:
Scenario 1: Create a pod that references a non-existent image
$ cluster/kubectl.sh run no-image --image=openshift/foo --replicas=1
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
no-image no-image openshift/foo run=no-image 1
$ cluster/kubectl.sh get pods no-image-5fo07
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE
no-image-5fo07 10.246.1.9 10.245.1.3/10.245.1.3 run=no-image Pending 4 minutes
no-image openshift/foo Waiting Error: image openshift/foo:latest not found
$ cluster/kubectl.sh describe pods no-image-5fo07
correctly.
Name: no-image-5fo07
Image(s): openshift/foo
Node: 10.245.1.3/10.245.1.3
Labels: run=no-image
Status: Pending
Replication Controllers: no-image (1/1 replicas created)
Containers:
no-image:
Image: openshift/foo
State: Waiting
Reason: Error: image openshift/foo:latest not found
Ready: False
Restart Count: 0
Conditions:
Type Status
Ready False
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {scheduler } scheduled Successfully assigned no-image-5fo07 to 10.245.1.3
Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD pulled Successfully pulled image "gcr.io/google_containers/pause:0.8.0"
Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD created Created with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8
Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD started Started with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8
Tue, 02 Jun 2015 16:19:57 -0400 Tue, 02 Jun 2015 16:25:17 -0400 33 {kubelet 10.245.1.3} spec.containers{no-image} failed Failed to pull image "openshift/foo": Error: image openshift/foo:latest not found
This looks correct to me, after all the image may appear soon, and if so, it should be run. The event count is incremented forever with every kubelet synch loop, so maybe after a few weeks it could reach too big a number ;-)
Scenario 2 - Attempt to pull an image from a non-existent registry
$ cluster/kubectl.sh run no-image --image=gce.google.com/foo --replicas=1
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
no-image no-image gce.google.com/foo run=no-image 1
$ cluster/kubectl.sh get pods no-image-81les
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE
no-image-81les 10.246.1.10 10.245.1.3/10.245.1.3 run=no-image Pending 26 seconds
no-image gce.google.com/foo Waiting Error: image foo:latest not found
Scenario 3 - Attempt to pull an image from a blocked registry
Modified /etc/sysconfig/docker and added following:
OPTIONS='--block-registry *'
Restarted docker on the node, and ran an image that docker could not pull.
$ cluster/kubectl.sh get pods test-mpp3p
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE
test-mpp3p 10.246.1.4 10.245.1.3/10.245.1.3 run=test Pending 31 seconds
test openshift/hello-openshift Waiting API error (500): No configured registry to pull from.
So that is good as well.
Awesome testing
On Jun 2, 2015, at 4:52 PM, Derek Carr notifications@github.com wrote:
I tested a few scenarios on latest Kubernetes:
Scenario 1: Create a pod that references a non-existent image
$ cluster/kubectl.sh run no-image --image=openshift/foo --replicas=1 CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS no-image no-image openshift/foo run=no-image 1 $ cluster/kubectl.sh get pods no-image-5fo07 POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE no-image-5fo07 10.246.1.9 10.245.1.3/10.245.1.3 run=no-image Pending 4 minutes
no-image openshift/foo Waiting Error: image openshift/foo:latest not found $ cluster/kubectl.sh describe pods no-image-5fo07 correctly. Name: no-image-5fo07 Image(s): openshift/foo Node: 10.245.1.3/10.245.1.3 Labels: run=no-image Status: Pending Replication Controllers: no-image (1/1 replicas created) Containers: no-image: Image: openshift/foo State: Waiting Reason: Error: image openshift/foo:latest not found Ready: False Restart Count: 0 Conditions: Type Status Ready False Events: FirstSeen LastSeen Count From SubobjectPath Reason Message Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {scheduler } scheduled Successfully assigned no-image-5fo07 to 10.245.1.3 Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD pulled Successfully pulled image "gcr.io/google_containers/pause:0.8.0" Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD created Created with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8 Tue, 02 Jun 2015 16:19:56 -0400 Tue, 02 Jun 2015 16:19:56 -0400 1 {kubelet 10.245.1.3} implicitly required container POD started Started with docker id f62240b9031e126c28c341a6c7f660888620ade822ebe520de4f37c1e0f74dc8 Tue, 02 Jun 2015 16:19:57 -0400 Tue, 02 Jun 2015 16:25:17 -0400 33 {kubelet 10.245.1.3} spec.containers{no-image} failed Failed to pull image "openshift/foo": Error: image openshift/foo:latest not found This looks correct to me, after all the image may appear soon, and if so, it should be run. The event count is incremented forever with every kubelet synch loop, so maybe after a few weeks it could reach too big a number ;-)Scenario 2 - Attempt to pull an image from a non-existent registry
$ cluster/kubectl.sh run no-image --image=gce.google.com/foo --replicas=1 CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS no-image no-image gce.google.com/foo run=no-image 1 $ cluster/kubectl.sh get pods no-image-81les POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE no-image-81les 10.246.1.10 10.245.1.3/10.245.1.3 run=no-image Pending 26 seconds
no-image gce.google.com/foo Waiting Error: image foo:latest not found Scenario 3 - Attempt to pull an image from a blocked registryModified /etc/sysconfig/docker and added following:
OPTIONS='--block-registry *' Restarted docker on the node, and ran an image that docker could not pull.
$ cluster/kubectl.sh get pods test-mpp3p POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED MESSAGE test-mpp3p 10.246.1.4 10.245.1.3/10.245.1.3 run=test Pending 31 seconds
test openshift/hello-openshift Waiting API error (500): No configured registry to pull from. So that is good as well.— Reply to this email directly or view it on GitHub.
Given the observed behavior in the reported testing, I think it's safe to close this issue, and just note that we will continue to try to pull the image in perpetuity but the system does report an event on why the image could or could not be retrieved, and now displays a message.
if the node doesn't have connectivity with the external registry/hub, the pod will be in pending status forever.
[oseuser@ose1 beta3]$ osc get pods POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS CREATED hello-openshift-1-ivcje 10.1.1.5 hello-openshift openshift/hello-openshift ose3.novalocal/10.78.29.4 deployment=hello-openshift-1,deploymentconfig=hello-openshift,name=hello-openshift Running 3 days tomcat-openshift 10.1.1.6 tomcat-openshift tomcat:latest ose3.novalocal/10.78.29.4 name=tomcat-openshift Pending 33 minutes