openshift / jenkins-plugin

Apache License 2.0
81 stars 50 forks source link

Doesn't detect failed replication controller/deployment configuration #33

Closed livelace closed 8 years ago

livelace commented 8 years ago
  1. Create example DC with:

https://paste.fedoraproject.org/350947/60028840/

  1. Deploy DC, oc get -o yaml rc:

https://paste.fedoraproject.org/350950/60028895/

  1. Status of pod, oc get -o yaml pod:

https://paste.fedoraproject.org/350952/28958146/

==>

As result, we think that RC is running, but pod inside RC is not running. How do we can detect that DC not running and don't start the next build steps ?

livelace commented 8 years ago

In other words, why we don't subscribe to event of pod status/why we don't wait completion of pod creation ? Can we check pod status through "Verify OpenShift Deployment" ?

gabemontero commented 8 years ago

@livelace the "Verify OpenShift Deployment" step currently stops after seeing the RC go to Complete, but after seeing you scenario, I realize it could do better.

I'll start looking into including a monitor of the deploy pod status into that step's logic (perhaps the other deploy related steps as well - we'll review).

@bparees - FYI

bparees commented 8 years ago

@livelace perhaps you could use the http check step to confirm the pod is running? or a readiness check in your DC that confirms the pod came up (which will block the deployment completion).

livelace commented 8 years ago

@bparees My service is not HTTP capable, I thought about this.

My case:

  1. First build step - start service1.
  2. Second build step - start service2.
  3. I want to start third build step, which depend from 1/2. I get problems:

a) I don't know that service1 and service2 is up and running and all hooks completed. I can't stop Jenkins tasks, because I think that is all right.

b) I can't scale deployments to zero at the proper time, because I don't know that all tasks inside pods are completed.

I can't properly manage tasks, because I don't know about states of tasks.

gabemontero commented 8 years ago

Not to overly distract from this thread but I should have deployer pod state verification working either later today or tomorrow.

On Thursday, April 7, 2016, Oleg Popov notifications@github.com wrote:

@bparees https://github.com/bparees My service is not HTTP capable, I thought about this.

My case:

  1. First build step - start service1.
  2. Second build step - start service2.
  3. I want to start third build step, which depend from 1/2. I get problems:

a) I don't know that service1 and service2 is up and running and all hooks completed. I can't stop Jenkins tasks, because I think that is all right.

b) I can't scale deployments to zero at the proper time, because I don't know that all tasks inside pods are completed.

I can't properly manage tasks, because I don't know about states of tasks.

— You are receiving this because you were assigned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207021179

livelace commented 8 years ago

@gabemontero It will be great!

bparees commented 8 years ago

@gabemontero deployer pod state, or just pod state?

gabemontero commented 8 years ago

@bparees I'll look for both to a degree. Testing shows the deployer pod is prunned minimally if successful. So I'll first see if we have a deployer pod in a non complete state. If a deployer pod no longer exists, I'll confirm that a running pod exists for the correct gen of the deployment.

On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:

@gabemontero https://github.com/gabemontero deployer pod state, or just pod state?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207037780

bparees commented 8 years ago

the replication controller (deployment) ought to reflect the state of the deployer pod, so i don't see the value in looking at the deployer pod.

gabemontero commented 8 years ago

I have not seen that yet at least on what i was previously examing from the output provided and my duplication with the evil post start hook but I'll double check when i get back to the office. The deployment phase still said complete.

On Thursday, April 7, 2016, Ben Parees notifications@github.com wrote:

the replication controller (deployment) ought to reflect the state of the deployer pod, so i don't see the value in looking at the deployer pod.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207041360

gabemontero commented 8 years ago

Yep, at least with the latest level from upstream origin, @bparees is correct wrt the RC being sufficient. Adding the same the same lifecycle: postStart sabotage, the RC ends up in Failed state per the deployment.phase annotation on the RC. I think my earlier repo did not go far enough or something. Could have swore I saw it go to Complete, but I now consistently see it go to Failed after several runs.

So we are either at two spots @livelace :
1) you could try adding a "Verify OpenShift Deployment" step and hopefully you see the same results 2) if your output at https://paste.fedoraproject.org/350950/60028895/ was in fact captured after the Pod failed, then I suspect your version of OpenShift is far back enough from the latest were you are seeing a difference in deployment behavior (certainly that component has evolved some this last release cycle). If that is the case, it may be simply a matter of when you can upgrade.

livelace commented 8 years ago

Not working:

  1. [root@openshift-master1 ~]# oc version oc v1.1.6 kubernetes v1.2.0-36-g4a3f9c5
  2. Jenkins console output (verbose mode), job with verification, job completed without any errors:

https://paste.fedoraproject.org/351461/91294146/

  1. RC status:

https://paste.fedoraproject.org/351462/46009139/

[root@openshift-master1 ~]# oc get rc NAME DESIRED CURRENT AGE testing-11.0-drweb-netcheck-nossl-peer1-1 0 0 17h testing-11.0-drweb-netcheck-nossl-peer1-2 1 1 16h testing-11.0-drweb-netcheck-nossl-peer2-1 0 0 17h testing-11.0-drweb-netcheck-nossl-peer2-2 0 0 16h testing-11.0-drweb-netcheck-nossl-peer3-1 0 0 17h testing-11.0-drweb-netcheck-nossl-peer3-2 0 0 16h

  1. Pod status:

https://paste.fedoraproject.org/351463/46009150/ http://prntscr.com/apk3ey

livelace commented 8 years ago

NAME READY STATUS RESTARTS AGE testing-11.0-drweb-netcheck-nossl-peer1-2-6zkg7 0/1 CrashLoopBackOff 14 1h

livelace commented 8 years ago

"Verify whether the pods are up" in settings will be enough :)

gabemontero commented 8 years ago

@livelace I'll see if I can pull a v1.1.6 version of openshift and reproduce what you are seeing, but at the moment, it appears that we are falling into category 2) from my earlier comment. If that does prove to be true, than rather than adding the new step, we'll want you to try the existing step against v1.2.0 when it becomes available (that is the "latest version" I was testing against).

gabemontero commented 8 years ago

@livelace - one additional request while I try to reproduce at a lower level of code - when you reproduce, is the equivalent of the testing-11.0-drweb-netcheck-nossl-peer1-2-deploy pod from your last repro staying around long enough for you to dump its contents to json/yaml ? If so, can you provide that as well (assuming you'll need to reproduce again to do so)

thanks

gabemontero commented 8 years ago

ok, I went to the same level as @livelace and could not reproduce. One additional question did occur to me ... do you create a successful deployment, the scale it down, edit the DC to introduce the `lifecycle: postStart: exec: command:

livelace commented 8 years ago

@gabemontero Hello.

No, DC has hook from the beginning.

livelace commented 8 years ago

After creating "DC" has zero count.

livelace commented 8 years ago

Creation progress - https://paste.fedoraproject.org/351916/14601346/

livelace commented 8 years ago

Error - https://paste.fedoraproject.org/351917/60134739/

livelace commented 8 years ago

After error occur I can scale down DC and to repeat all again.

livelace commented 8 years ago

I can modify script (exit 0) that runs inside hook and all be fine with DC (without any modification of configuration).

I can modify script (exit 0) during attempt of set up DC, and DC will be work fine.

PS. It is possible because I can use dedicated script, that contain "exit 1"

gabemontero commented 8 years ago

On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com wrote:

Creation progress - https://paste.fedoraproject.org/351916/14601346/

Hey @livelace - not sure what you mean by "creation progress". I just see another Pod yaml for a Pod created by a replication controller.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207515537

livelace commented 8 years ago

"Creation progress" - scale DC to 1.

gabemontero commented 8 years ago

Thanks for the additional details. I have a couple of thoughts on reworking my repo attempts. I'll report back when I have something tangible.

On Fri, Apr 8, 2016 at 1:34 PM, Gabe Montero gmontero@redhat.com wrote:

On Fri, Apr 8, 2016 at 12:58 PM, Oleg Popov notifications@github.com wrote:

Creation progress - https://paste.fedoraproject.org/351916/14601346/

Hey @livelace - not sure what you mean by "creation progress". I just see another Pod yaml for a Pod created by a replication controller.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207515537

livelace commented 8 years ago

During hour I can grant access to my test environment, I think.

livelace commented 8 years ago

@gabemontero Can you connect over SSH to my environment ?

gabemontero commented 8 years ago

OK, I've reproduced it. I did: 1) before starting a deployment, added your lifecycle/podstart with exit 0 2) deployed, then scaled back down to 0 3) edited the DC, changing lifecycle/podstart to exit 1 4) scaled to 1 ... Pod fails, but next gen of RC says it completed successfully.

Note, if I start with the lifecycle/podstart exitting with 1 and initial replicas of 1, then the RC is marked as failed. This is basically what my recent repo attempts did. And now that I understand what is going on, I'm pretty positive that my very first repro attempt, where I saw the RC in complete state, was when I edited a previously used DC to added the lifecycle/podstart with exit 1 check. So good for me that I was not imagining things originally :-).

Now, what to do about this. It is not a given we want to address this with a new plugin step. 1) this could be a deployment bug that needs to get addressed, with the RC reflecting the state of the pod 2) the nuance of updating a DC which has been deployed one way, scaled down, editted, and redployed though could be "against current design" or some such. 3) certainly the lifecycle/podstart induced failure is merely a means for producing an unexpected container start up failure, but are there some nuances wrt using that to tank the container, where a container dying on startup "naturally" will have different characteristics

@bparees: thoughts? ... and I thought about tagging our friends in platform mgmt now, but decided on getting a sanity check from you before officially pulling that trigger.

gabemontero commented 8 years ago

I'll try the exit 1, but initial replicas 0 permutation, then scale to 1, as well ... see if that is different.

livelace commented 8 years ago

It's strange, the problem exist at once after DC import in my situation. But in my situation initial replica count equals 0.

gabemontero commented 8 years ago

So it also occurred for me when: 1) create with exit 1 and replicas 0 2) RC is created with state "Complete", but of course no pod was started up 3) then scale to 1, and RC stays complete when pod fails.

So one interpretation is that the openshift.io/deployment.phase annotation on the RC is only updated when the RC is initially created (as part of doing the first deployment, where replicas could be either 1 or 0). If we cause the Pod failure in conjunction with the RC initially coming up, that annotation reflects the error. But once the RC is created, perhaps that annotation is no longer maintained (either by design, or incorrectly, and hence a bug). If by design, then I'm not seeing where else in the RC we could infer Pod state. Perhaps I'm missing something, but if not, then the plugin step does in fact have to pull the Pod up directly.

Next steps from my perspective, the @bparees sanity check, followed by most likely platform team engagement, with either bug fix on their end, or the original change I was envisioning for "verify openshift deployment" to check Pod state in addition to RC state.

livelace commented 8 years ago

It is cool, thanks for your help!

bparees commented 8 years ago

@gabemontero it sounds like it's probably working correctly if i understand the scenario. The RC did complete successfully (deployed successfully). The fact that the RC can't be scaled up to 1 because basically you've got a bad pod definition isn't going to reverse the fact that the deployment succeeded. (that is, scaling is not the same as deploying).

it is a bit hokey since you have to start with a count of 0 to get there. If the original replica count was 1, the deployment never would have succeeded, as you saw.

so you can run it by platform management, but i think it's basically working as we'd expect... so the question comes back to "what, if anything, can we do about this?"

doesn't the replica count verification step handle this scenario? that is, you can always add another step to verify that the correct number of replicas are running, which in this scenario, they won't be.

livelace commented 8 years ago

@bparees Not sure that is "bad pod definition", because it is correctly. I set my script, which should start after pod initialization, but this script definitely can return 0/1 and DC should reflect to these probable situations.

gabemontero commented 8 years ago

The replica count of the RC is showing 1 in this error scenario @bparees. I have a "live" version of the error state and that is what it is showing. And yeah, based on what you outlined, I don't think we should pull in platform mgmt.

Thus, I think we need to introduce the Pod state verification I had started earlier this week.

On Fri, Apr 8, 2016 at 3:04 PM, Ben Parees notifications@github.com wrote:

@gabemontero https://github.com/gabemontero it sounds like it's probably working correctly if i understand the scenario. The RC did complete successfully (deployed successfully). The fact that the RC can't be scaled up to 1 because basically you've got a bad pod definition isn't going to reverse the fact that the deployment succeeded. (that is, scaling is not the same as deploying).

it is a bit hokey since you have to start with a count of 0 to get there. If the original replica count was 1, the deployment never would have succeeded, as you saw.

so you can run it by platform management, but i think it's basically working as we'd expect... so the question comes back to "what, if anything, can we do about this?"

doesn't the replica count verification step handle this scenario? that is, you can always add another step to verify that the correct number of replicas are running, which in this scenario, they won't be.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207560919

gabemontero commented 8 years ago

Add it within the existing "verify openshift deployment"

On Fri, Apr 8, 2016 at 3:15 PM, Gabe Montero gmontero@redhat.com wrote:

The replica count of the RC is showing 1 in this error scenario @bparees. I have a "live" version of the error state and that is what it is showing. And yeah, based on what you outlined, I don't think we should pull in platform mgmt.

Thus, I think we need to introduce the Pod state verification I had started earlier this week.

On Fri, Apr 8, 2016 at 3:04 PM, Ben Parees notifications@github.com wrote:

@gabemontero https://github.com/gabemontero it sounds like it's probably working correctly if i understand the scenario. The RC did complete successfully (deployed successfully). The fact that the RC can't be scaled up to 1 because basically you've got a bad pod definition isn't going to reverse the fact that the deployment succeeded. (that is, scaling is not the same as deploying).

it is a bit hokey since you have to start with a count of 0 to get there. If the original replica count was 1, the deployment never would have succeeded, as you saw.

so you can run it by platform management, but i think it's basically working as we'd expect... so the question comes back to "what, if anything, can we do about this?"

doesn't the replica count verification step handle this scenario? that is, you can always add another step to verify that the correct number of replicas are running, which in this scenario, they won't be.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207560919

bparees commented 8 years ago

if the pod isn't reported as running, it seems like a mistake for the RC to be reporting the replica count as 1.

bparees commented 8 years ago

@livelace if you want to prevent the deployment from succeeding, you need to use a readiness check, not a post-start hook.

gabemontero commented 8 years ago

On Fri, Apr 8, 2016 at 3:28 PM, Ben Parees notifications@github.com wrote:

if the pod isn't reported as running, it seems like a mistake for the RC to be reporting the replica count as 1.

I could see applying your earlier rationale to this facet as well. Perhaps we should engage platform mgmt here, but I'm still coming to the mind that adding the redundancy of checking the pod state is still a good thing irregardless, given the complexities and on-going (at least from my perspective) evolution for this particular area.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/openshift/jenkins-plugin/issues/33#issuecomment-207569581

livelace commented 8 years ago

@bparees I understand this, but not all configuration contain service. I think, that with liveness probes we can devise something, but it complicate task, instead of just detect pod status.

ironcladlou commented 8 years ago

@kargakis with a post lifecycle hook, can the container become ready and then not ready if the hook fails? It's possible the old rolling updater will observe the first ready state and move on even though the pod will become not ready soon.

ironcladlou commented 8 years ago

Upon closer inspection, @bparees was right with earlier statements: the example is using the default rolling params, so 25% max unavailable (for the desired replica count of 1, this means 0 is the minimum ready pods to maintain during the update is 0). So, scale-down of the old RC will proceed regardless of new RC pod readiness. The absence of a readiness check on the RC's container specs means that the new RC's active pod count will not be coupled to pod readiness, and it doesn't seem like lifecycle post hook failure will affect the RC's active replica count.

Seems like you need a readiness check to define what it means for your pod to be ready; the failure of a post hook may not necessarily imply the pod is not ready.

0xmichalis commented 8 years ago

The rolling updater doesn't care about readiness when scaling up:/

livelace commented 8 years ago

Not working with liveness/readiness probes:

  1. Liveness case:

nginx-live.yaml - https://paste.fedoraproject.org/352642/14602069/ RC status - https://paste.fedoraproject.org/352639/46020692/ Pod status - https://paste.fedoraproject.org/352640/02069641/

  1. Readiness case:

nginx-ready.yaml - https://paste.fedoraproject.org/352646/60207412/ RC status - https://paste.fedoraproject.org/352650/20746814/ Pod status - https://paste.fedoraproject.org/352652/02075051/

All cases were checked with Jenkins task (scale + verify).

livelace commented 8 years ago

[root@openshift-master1 ~]# cat /share/run.sh

!/bin/bash

exit 1 [root@openshift-master1 ~]# cat /share/check.sh

!/bin/bash

exit 0

livelace commented 8 years ago

Guys, tell me something, please :)

ironcladlou commented 8 years ago

Two things here:

  1. If your concern is a violation of your minimum availability requirements, use readiness checks and a >0 availability threshold. The rolling updater won't scale down below your threshold given you have readiness checks in place.
  2. If your concern is the updater scaling up without regards to pod readiness, we'll need to take the discussion to origin/kubernetes, because the rolling updater progresses scale-ups based on replica count of the RC which doesn't take readiness into account.

cc @kargakis

gabemontero commented 8 years ago

Hey @livelace - to build on what @ironcladlou outlined, @bparees and I have had some discussion offline. I have a prototype for the plugin which inspects Pod state, but @bparees has convinced me that it only handles rudimentary cases, and that we should finish the path of understanding what is needed for the ReplicationController to more accurately reflect the state of the Pods.... ideally, we still want the jenkins plugin to stop its examination at the ReplicationController level, and leverage all the infrastructure in place on the DC, RC, and Pod side (and avoid duplicating similar tech in the plugin).

But let's see how @ironcladlou 's 1) and 2) progress, and then we'll level set.

livelace commented 8 years ago

@ironcladlou @gabemontero

Ok, thanks. I understood this and agree with yours conclusions. At this moment I can detect DC and hook statuses by myself (through "status" files). But I must to know, how this case will be solved.