q-shift / backstage-playground

2 stars 8 forks source link

Race condition with idpbuilder and argocd on e2e GH workflow #163

Closed cmoulliard closed 5 months ago

cmoulliard commented 5 months ago

Issue

From time to time we are facing to the following issue and that we can resume as such using argocd 2.10.7 running part of a github workflow in a kube cluster v1.29 :

Sometimes the github workflow succeeds or fails ! Job succeeded: https://github.com/ch007m/test-2e2-job/actions/runs/9611362409 Job failed: https://github.com/ch007m/test-2e2-job/actions/runs/9611571457

Investigation

After digging into the logs and talking about that with idpbuildder folks, it appeared that the ArgoCD Application (in,stalling Argocd) was not refreshed and by consequence the Application controller started without the patched configMap changing the default values and adding new like: application.namespaces, etc

If argocd is started without the property application.namespaces defining the namespaces where Applications can be created, then they will not processed. This is exactly the problem that we have been faced ;-)

A temporary workaround has been added to the job till idpbuilder 0.6.0 will fix that problem

      - name: Wait till IDP ArgoCD application is sync; ConfigMap patched
        run: |
          SCRIPTS=$(pwd)/.github/scripts

          echo "Temporary workaround to refresh ArgoCD Application till https://github.com/cnoe-io/idpbuilder/pull/307 is released"
          kubectl annotate --overwrite applications -n argocd argocd argocd.argoproj.io/refresh='normal'

          if ! $SCRIPTS/waitFor.sh application argocd argocd Healthy; then
            echo "Failed to watch application argocd in namespace argocd"
            exit 1;
          fi

          echo "Wait till ConfigMap is patched with data: application.namespaces ..."          
          until kubectl get -n argocd cm/argocd-cmd-params-cm -o json | jq -e '.data | has("application.namespaces")'; do
             echo "Still waiting ..."
             sleep 10s
          done

          echo "Rollout Argocd as resources changed ..."
          kubectl rollout restart -n argocd deployment argocd-server
          kubectl rollout restart -n argocd statefulset argocd-application-controller

          kubectl rollout status --watch statefulset/argocd-application-controller -n argocd --timeout=600s