opendevstack / ods-jenkins-shared-library

Shared Jenkins library which all ODS projects & components use - provisioning, SonarQube code scanning, Nexus publishing, OpenShift template based deployments and repository orchestration
Apache License 2.0
73 stars 57 forks source link

MRO fails during promotion to P/Q with mono-repo and 2 images #688

Open clemensutschig opened 3 years ago

clemensutschig commented 3 years ago

Describe the bug Mono-repo with 2 images (and image triggers to :latest) fails during promote to Q/P with script exit -1 at .rollout

step 1: import 2 images:

Importing images - deployment: front-back, container: front-back-frontend, image: front-back-frontend@sha256:b711ac347ac72aba428d717b9988949faccc83fc47cfd00a3342c2752d96c213, source: gihkw1-test

Importing images - deployment: front-back, container: front-back-backend, image: front-back-backend@sha256:fa7d63b6f7d98fe776d7de5d23e53a4fcaa9a95bad159de6527640abedd82537, source: gihkw1-test

later we try to rollout the dc .. and it's (obviously) running already, ... (but this should be accounted for, and for some reason, still breaks)

+ oc -n gihkw1-prod get dc/front-back -o 'jsonpath={.status.latestVersion}'
[Pipeline] sh (Rollout latest version of dc/front-back)
+ oc -n gihkw1-prod rollout latest dc/front-back
error: #11 is already in progress (Running).
[Pipeline] sh (Get latest version of dc/front-back)
+ oc -n gihkw1-prod get dc/front-back -o 'jsonpath={.status.latestVersion}'
[Pipeline] }
[Pipeline] // dir
[Pipeline] echo
WARN: script returned exit code 1
[Pipeline] }
Failed in branch front-back
[Pipeline] // parallel
[Pipeline] }
Failed in branch Deploy
[Pipeline] // parallel
[Pipeline] }
[Pipeline] // node
[Pipeline] echo
WARN: Error occured within the orchestration pipeline: script returned exit code 1
[Pipeline] echo
[Deploy] **** ENDED orchestration stage **** (took 42280 ms)
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] echo
[ods-mro-pipeline] **** ENDED orchestration pipeline **** (took 95127 ms)

Affected version (please complete the following information):

clemensutschig commented 3 years ago

@michaelsauter can you take a look into this, this is pretty critical

clemensutschig commented 3 years ago

It's weird, because we honor this case in https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/services/OpenShiftService.groovy#L195-L197 - but somehow this does NOT seem to work?!

clemensutschig commented 3 years ago

@jorge-romero @metmajer - this is a blocker for 4 ...

clemensutschig commented 3 years ago

I am inclined to put a while loop around the getLatestVersion - this should never fail ..

michaelsauter commented 3 years ago

Do you have logs prior to what is shown above? I am slightly confused by the output as it seems to continue after error: #11 is already in progress? I wonder on which command it actually fails ... it looks like https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/services/OpenShiftService.groovy#L213 fails ... which I don't get?

michaelsauter commented 3 years ago

Actually, maybe the failure is from https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/services/OpenShiftService.groovy#L199. That would occur if the rollout is running already (which it is, looking at the logs) but somehow status.latestVersion is not greater than the version passed.

clemensutschig commented 3 years ago

@michaelsauter same here, and it's bubbling - and stopping .. (there is no catch, except on the most outer layer)

https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/orchestration/util/MROPipelineUtil.groovy#L322

status.latestVersion is what I am expecting as well (as the rollout is still running) ...

michaelsauter commented 3 years ago

my theory: https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/orchestration/phases/DeployOdsComponent.groovy#L48 is a loop, somehow in one of the iterations the "prior version" is already the "new version"?

clemensutschig commented 3 years ago

with the 2 images that are imported and cancelled (deployments) versions - that could be indeed the case... hmmm

what if we were to check if there is a deployment running - cancel that, and then rollout ourselves? .. just thoughts ...

the logic - sort of latest+1 does not seem to work as we hoped it would

clemensutschig commented 3 years ago

the other option I could think of - in case of an exception - is to verify the containers, and if they have the latest sha's skip the err?

michaelsauter commented 3 years ago

Actually, what about passing the priorVersion into https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/services/OpenShiftService.groovy#L154? The prior version is collected BEFORE we import images or change any config by applying templates. Wouldn't this priorVersion be "enough" as a sanity check in startRollout if oc rollout fails there?

Still, I do not understand the failure. Can you share the deployment descriptor file in use? And the triggers set on the DC?

clemensutschig commented 3 years ago

3 triggers on the DC .. config change and image change (for both images) - so in reality you can get 3 deployments

a) tailor import b) image 1 c) image 2

I like the idea of "just" checking whether it's > then the prio image we get ..

clemensutschig commented 3 years ago

the deployments.json looks as follows

{
     "deployments": {
          "front-back": {
               "containers": {
                    "front-back-frontend": "front-back-frontend@sha256:b711ac347ac72aba428d717b9988949faccc83fc47cfd00a3342c2752d96c213",
                    "front-back-backend": "front-back-backend@sha256:fa7d63b6f7d98fe776d7de5d23e53a4fcaa9a95bad159de6527640abedd82537"
               }
          }
     },
     "CREATED_BY_BUILD": "front-0.19.0/55"
}
clemensutschig commented 3 years ago

deployment config in question:

apiVersion: v1
kind: ReplicationController
metadata:
  annotations:
    openshift.io/deployer-pod.completed-at: '2021-07-12 16:48:19 +0200 CEST'
    openshift.io/deployer-pod.created-at: '2021-07-12 15:11:35 +0200 CEST'
    openshift.io/deployer-pod.name: front-back-11-deploy
    openshift.io/deployment-config.latest-version: '11'
    openshift.io/deployment-config.name: front-back
    openshift.io/deployment.phase: Complete
    openshift.io/deployment.replicas: ''
    openshift.io/deployment.status-reason: image change
    openshift.io/encoded-deployment-config: >
      {"kind":"DeploymentConfig","apiVersion":"apps.openshift.io/v1","metadata":{"name":"front-back","namespace":"gihkw1-prod","selfLink":"/apis/apps.openshift.io/v1/namespaces/gihkw1-prod/deploymentconfigs/front-back","uid":"837c70c3-a0e6-11eb-84af-0050569e7b02","resourceVersion":"410791490","generation":14,"creationTimestamp":"2021-04-19T08:09:07Z","labels":{"app":"gihkw1-front-back","template":"monorepo-component-template"},"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"apps.openshift.io/v1\",\"kind\":\"DeploymentConfig\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"gihkw1-front-back\",\"template\":\"monorepo-component-template\"},\"name\":\"front-back\",\"namespace\":\"gihkw1-prod\"},\"spec\":{\"replicas\":1,\"revisionHistoryLimit\":10,\"selector\":{\"app\":\"gihkw1-front-back\",\"deploymentconfig\":\"front-back\"},\"strategy\":{\"activeDeadlineSeconds\":21600,\"resources\":{},\"rollingParams\":{\"intervalSeconds\":1,\"maxSurge\":\"25%\",\"maxUnavailable\":\"25%\",\"timeoutSeconds\":600,\"updatePeriodSeconds\":1},\"type\":\"Rolling\"},\"template\":{\"metadata\":{\"labels\":{\"app\":\"gihkw1-front-back\",\"deploymentconfig\":\"front-back\",\"env\":\"dev\"}},\"spec\":{\"containers\":[{\"image\":\"gihkw1-prod/front-back-frontend:latest\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"front-back-frontend\",\"ports\":[{\"containerPort\":8080,\"protocol\":\"TCP\"}],\"resources\":{\"limits\":{\"cpu\":\"100m\",\"memory\":\"128Mi\"},\"requests\":{\"cpu\":\"50m\",\"memory\":\"128Mi\"}},\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\"},{\"image\":\"gihkw1-prod/front-back-backend:latest\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"front-back-backend\",\"ports\":[{\"containerPort\":8081,\"protocol\":\"TCP\"}],\"resources\":{\"limits\":{\"cpu\":\"100m\",\"memory\":\"128Mi\"},\"requests\":{\"cpu\":\"50m\",\"memory\":\"128Mi\"}},\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\"}],\"dnsPolicy\":\"ClusterFirst\",\"restartPolicy\":\"Always\",\"schedulerName\":\"default-scheduler\",\"securityContext\":{},\"terminationGracePeriodSeconds\":30}},\"test\":false,\"triggers\":[{\"type\":\"ConfigChange\"},{\"imageChangeParams\":{\"automatic\":true,\"containerNames\":[\"front-back-backend\"],\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"front-back-backend:latest\",\"namespace\":\"gihkw1-prod\"}},\"type\":\"ImageChange\"},{\"imageChangeParams\":{\"automatic\":true,\"containerNames\":[\"front-back-frontend\"],\"from\":{\"kind\":\"ImageStreamTag\",\"name\":\"front-back-frontend:latest\",\"namespace\":\"gihkw1-prod\"}},\"type\":\"ImageChange\"}]}}\n"}},"spec":{"strategy":{"type":"Rolling","rollingParams":{"updatePeriodSeconds":1,"intervalSeconds":1,"timeoutSeconds":600,"maxUnavailable":"25%","maxSurge":"25%"},"resources":{},"activeDeadlineSeconds":21600},"triggers":[{"type":"ConfigChange"},{"type":"ImageChange","imageChangeParams":{"automatic":true,"containerNames":["front-back-backend"],"from":{"kind":"ImageStreamTag","namespace":"gihkw1-prod","name":"front-back-backend:latest"},"lastTriggeredImage":"..../gihkw1-test/front-back-backend@sha256:fa7d63b6f7d98fe776d7de5d23e53a4fcaa9a95bad159de6527640abedd82537"}},{"type":"ImageChange","imageChangeParams":{"automatic":true,"containerNames":["front-back-frontend"],"from":{"kind":"ImageStreamTag","namespace":"gihkw1-prod","name":"front-back-frontend:latest"},"lastTriggeredImage":"...../gihkw1-test/front-back-frontend@sha256:b711ac347ac72aba428d717b9988949faccc83fc47cfd00a3342c2752d96c213"}}],"replicas":1,"revisionHistoryLimit":10,"test":false,"selector":{"app":"gihkw1-front-back","deploymentconfig":"front-back"},"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"gihkw1-front-back","deploymentconfig":"front-back","env":"dev"}},"spec":{"containers":[{"name":"front-back-frontend","image":"....../gihkw1-test/front-back-frontend@sha256:b711ac347ac72aba428d717b9988949faccc83fc47cfd00a3342c2752d96c213","ports":[{"containerPort":8080,"protocol":"TCP"}],"resources":{"limits":{"cpu":"100m","memory":"128Mi"},"requests":{"cpu":"50m","memory":"128Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"},{"name":"front-back-backend","image":"..../gihkw1-test/front-back-backend@sha256:fa7d63b6f7d98fe776d7de5d23e53a4fcaa9a95bad159de6527640abedd82537","ports":[{"containerPort":8081,"protocol":"TCP"}],"resources":{"limits":{"cpu":"100m","memory":"128Mi"},"requests":{"cpu":"50m","memory":"128Mi"}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","securityContext":{},"schedulerName":"default-scheduler"}}},"status":{"latestVersion":11,"observedGeneration":13,"replicas":1,"updatedReplicas":0,"availableReplicas":1,"unavailableReplicas":0,"details":{"message":"image
      change","causes":[{"type":"ImageChange","imageTrigger":{"from":{"kind":"DockerImage","name":"..../gihkw1-test/front-back-backend@sha256:fa7d63b6f7d98fe776d7de5d23e53a4fcaa9a95bad159de6527640abedd82537"}}}]},"conditions":[{"type":"Available","status":"True","lastUpdateTime":"2021-07-11T16:32:33Z","lastTransitionTime":"2021-07-11T16:32:33Z","message":"Deployment
      config has minimum
      availability."},{"type":"Progressing","status":"Unknown","lastUpdateTime":"2021-07-12T13:11:24Z","lastTransitionTime":"2021-07-12T13:11:24Z","message":"replication
      controller \"front-back-10\" is waiting for pod \"front-back-10-deploy\"
      to run"}],"readyReplicas":1}}
  creationTimestamp: '2021-07-12T13:11:35Z'
  generation: 2
  labels:
    app: gihkw1-front-back
    openshift.io/deployment-config.name: front-back
    template: monorepo-component-template
  name: front-back-11
  namespace: gihkw1-prod
  ownerReferences:
    - apiVersion: apps.openshift.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: DeploymentConfig
      name: front-back
      uid: 837c70c3-a0e6-11eb-84af-0050569e7b02
  resourceVersion: '410828255'
  selfLink: /api/v1/namespaces/gihkw1-prod/replicationcontrollers/front-back-11
  uid: af261b81-e312-11eb-bd77-0050569e3b56
spec:
  replicas: 1
  selector:
    app: gihkw1-front-back
    deployment: front-back-11
    deploymentconfig: front-back
  template:
    metadata:
      annotations:
        openshift.io/deployment-config.latest-version: '11'
        openshift.io/deployment-config.name: front-back
        openshift.io/deployment.name: front-back-11
      creationTimestamp: null
      labels:
        app: gihkw1-front-back
        deployment: front-back-11
        deploymentconfig: front-back
        env: dev
    spec:
      containers:
        - image: >-
            ...../gihkw1-test/front-back-frontend@sha256:b711ac347ac72aba428d717b9988949faccc83fc47cfd00a3342c2752d96c213
          imagePullPolicy: IfNotPresent
          name: front-back-frontend
          ports:
            - containerPort: 8080
              protocol: TCP
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 50m
              memory: 128Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - image: >-
            ...../gihkw1-test/front-back-backend@sha256:fa7d63b6f7d98fe776d7de5d23e53a4fcaa9a95bad159de6527640abedd82537
          imagePullPolicy: IfNotPresent
          name: front-back-backend
          ports:
            - containerPort: 8081
              protocol: TCP
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 50m
              memory: 128Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
michaelsauter commented 3 years ago

Thanks for sharing. Unfortunately I do not understand it.

Given there is only one deployment, the loop only runs once. As you say, three things might trigger a rollout, but all of them would happen AFTER we get the priorVersion in https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/orchestration/phases/DeployOdsComponent.groovy#L44. Triggering multiple times should be OK as we only check if the number is greater, not that it is n+1.

If https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/services/OpenShiftService.groovy#L150 still returns the same version as priorVersion, we attempt a rollout, which might fail, but then get the version again in https://github.com/opendevstack/ods-jenkins-shared-library/blob/3.x/src/org/ods/services/OpenShiftService.groovy#L195. At this point it most certainly should be updated as a rollout is definitely running. So how could it be the same version still? I fail to see a reason .... or maybe our understanding of when this value gets updated is wrong?

From https://docs.openshift.com/container-platform/3.9/rest_api/oapi/v1.DeploymentConfig.html:

A deployment is "triggered" when its configuration is changed or a tag in an Image Stream is changed. Triggers can be disabled to allow manual control over a deployment. The "strategy" determines how the deployment is carried out and may be changed at any time. The latestVersion field is updated when a new deployment is triggered by any means.

I don't think anymore that passing the priorVersion will ready help. Only additional logging will help there.

BTW, is this reproducible?

clemensutschig commented 3 years ago

yup . it's one of those fun bugs ... :) only reproduces on the way to prod

clemensutschig commented 3 years ago

image

shows the "two deployments" .. one cancelled - one rolled out ..

@michaelsauter the PR will dump information on the deployment ids .. so hopefully that also helps to diagnose this ..

metmajer commented 3 years ago

@jorge-romero @s2oBCN please have a look if this bug affects our demo application

clemensutschig commented 3 years ago

I believe @martin - that you may need quite some luck to repro this ...