tektoncd / pipeline

A cloud-native Pipeline resource.
https://tekton.dev
Apache License 2.0
8.48k stars 1.78k forks source link

Potential Deadlock in the relationship between tektoncd-pipeline-webhook and configMaps like config-defaults #4542

Open nikhil-thomas opened 2 years ago

nikhil-thomas commented 2 years ago

Summary

The dependency between ValidatingWebhookConfiguration config.webhook.pipeline.tekton.dev and Deployment tekton-pipelines-webhook is cyclic. This can lead to a potential deadlock which can block upgrades.

How to get "deadlocked"

  1. Install TektonPipelines
    kubectl apply -f https://github.com/tektoncd/pipeline/releases/download/v0.32.1/release.notags.yaml
  2. wait until all deployments are live
    
    kubectl get deploy -n tekton-pipelines

NAME READY UP-TO-DATE AVAILABLE AGE

tekton-pipelines-controller 1/1 1 1 15s

tekton-pipelines-webhook 1/1 1 1 12s

3. scale webhook deployment down to 0
```bash
kubectl scale -n tekton-pipelines deploy/tekton-pipelines-webhook --replicas=0

deployment.apps/tekton-pipelines-webhook scaled
  1. try editing config-defaults config map
    
    kubectl edit cm config-defaults -n tekton-pipelines

error: configmaps "config-defaults" could not be patched: Internal error occurred: failed calling webhook "config.webhook.pipeline.tekton.dev": Post "https://tekton-pipelines-webhook.tekton-pipelines.svc:443/config-validation?timeout=10s": no endpoints available for service "tekton-pipelines-webhook"

You can run kubectl replace -f /tmp/kubectl-edit-mnnza.yaml to try this update again.

`no endpoints available for service "tekton-pipelines-webhook"` πŸ§‘β€πŸ’» is expected

5. delete config-defaults configMap
```bash
kubectl delete cm config-defaults -n tekton-pipelines 

configmap "config-defaults" deleted
  1. try to scale the webhook deployment back to 1
    
    kubectl scale -n tekton-pipelines deploy/tekton-pipelines-webhook --replicas=1

deployment.apps/tekton-pipelines-webhook scaled


7. check webhook pods

```bash

kc get deploy,pods -n tekton-pipelines

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE

deployment.apps/tekton-pipelines-controller   1/1     1            1           5m4s

deployment.apps/tekton-pipelines-webhook      0/1     1            0           5m1s

NAME                                               READY   STATUS             RESTARTS   AGE

pod/tekton-pipelines-controller-7d8588667b-dkjcz   1/1     Running            0          5m5s

pod/tekton-pipelines-webhook-54944cd445-bvhbg      0/1     CrashLoopBackOff   2          50s
  1. check logs from webhook pod "error":"configmap \"config-defaults\" not found"

    {"level":"fatal","ts":"2022-02-02T13:07:38.796Z","logger":"tekton-pipelines-webhook","caller":"sharedmain/main.go:244","msg":"Failed to start configuration manager","commit":"3a88399","error":"configmap \"config-defaults\" not found","stacktrace":"github.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain.MainWithConfig\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain/main.go:244\ngithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain.MainWithContext\n\tgithub.com/tektoncd/pipeline/vendor/knative.dev/pkg/injection/sharedmain/main.go:171\nmain.main\n\tgithub.com/tektoncd/pipeline/cmd/webhook/main.go:234\nruntime.main\n\truntime/proc.go:225"}
  2. try to recreate-config-defaults configMap

    
    kubectl apply -f https://github.com/tektoncd/pipeline/releases/download/v0.32.1/release.notags.yaml

Error from server (InternalError): error when creating "https://github.com/tektoncd/pipeline/releases/download/v0.32.1/release.notags.yaml": Internal error occurred: failed

calling webhook "config.webhook.pipeline.tekton.dev": Post "https://tekton-pipelines-webhook.tekton-pipelines.svc:443/config-validation?timeout=10s": no endpoints available

for service "tekton-pipelines-webhook"


10. We have reached a Deadlock πŸ§‘β€πŸ’» 

 # The cyclic dependency

The ValidatingWebhookConfiguration config.webhook.pipeline.tekton.dev, validates various ConfigMaps created as part of Tektoncd Pipeline installation ([here](https://gist.github.com/nikhil-thomas/d9b9d5be45aaeefc44abfee59745d771#file-config-webhook-pipeline-tekton-dev-with-rules-yaml-L27)). 

The Deployment tekton-pipelines-webhook needs multiple ConfigMaps to be present on the cluster to start (eg: [here](https://github.com/tektoncd/pipeline/blob/3a88399531dc6f6ef1e401071a546cb2e6713d7e/cmd/webhook/main.go#L68)).

# Questions

1. Why does lack of webhook end points does not block creation of configMaps like config-defaults in the step 1 above (fresh install)?

In a fresh install the validatingWebhookCofiguration is not fully registred. When [config.webhook.pipeline.tekton.dev](https://github.com/tektoncd/pipeline/blob/main/config/500-webhooks.yaml#L71) is created the rules are not registered until the webhook pod comes up (knative/pkg magic). 

So when the validatingwebhookconfiguration is created it initally looks like [this](https://gist.github.com/nikhil-thomas/e7f1d8c050b32c003d7d0652747b7df3). So the ConfigMaps like config-defaults gets created πŸ‘‰ without validation. Then when the webhoook pod comes up, the rules in the validatingwebhookconfiguration are populated like [this](https://gist.github.com/nikhil-thomas/d9b9d5be45aaeefc44abfee59745d771#file-config-webhook-pipeline-tekton-dev-with-rules-yaml-L27). After this any edit operation on the configMaps are validated.

2. What happens during an upgrade?

Consider a scenario where tektoncd/pipeline is freshly installed with the command `kubectl apply -f pipelines-version-1.yaml`. Then it is upgraded to version 2 using `kubectl apply -f pipelines-version-2.yaml`. 

πŸ‘‰ In this case, any update on configMaps from version 2 are validated with the version 1 webhook as the webhook rules are already registered during version 1 installation. In effect, this hides the cyclic dependency. This is not ideal.

3. How did we (operator-dev) stumble upon this?

Operator does "replace" upgrades not "in-place" upgrades. That is operator deletes and recreates all installation resources except CRDs during any install. In addition, operator manages updates on configMaps like cofnig-defauts and feature-flags. 

During upgrades, webhook pods were not coming up as the config-maps were missing. Why where the configMaps missing? The new create request was not validated as the webhook was down.

4. How is this deadlock preempted in Operator?

If the installation of tektoncd/pipeline fails and there are no endpoints in webhook, then the operator resets the webhook rules to the initial state. This briefly breaks the validation dependency of the configMaps. The configMaps gets created, webhook pods comes up and the validation rules gets reinstated. 

[Fix upgrade deadlock in Pipeline and Triggers #592
](https://github.com/tektoncd/operator/pull/592)

# Note

This situation exists in almost all Tektoncd projects (including operator). Ideally we need to break this cyclic dependency in the webhook itself. 
nikhil-thomas commented 2 years ago

@vdemeester @afrittoli @dibyom @bobcatfish @imjasonh

tekton-robot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

vdemeester commented 2 years ago

/lifecycle frozen

errordeveloper commented 2 years ago

This does become quite a problem when the installation manifest is used with a GitOps operato like Flux. I found that dropping ValidatingWebhookConfiguration/config.webhook.pipeline.tekton.dev emilinates the probel, however webhook pod ends up logging plenty of errors. I think that disabling config map validation should be allowed, perhaps there should be some official way of doing that avoids endless errors also.

shipit commented 1 year ago

@errordeveloper did you find a resolution to this issue? I am seeing exactly the same issue during tekton installation using flux on GKE

errordeveloper commented 1 year ago

@errordeveloper did you find a resolution to this issue? I am seeing exactly the same issue during tekton installation using flux on GKE

I managed to break up webhooks into a separate kustomization. However, it didn't survive for long as tekton was too large of a dependency for the project I am working on.

hrivera-ntap commented 1 year ago

For anyone using ArgoCD to manage a tekton-pipelines helm chart that is getting burned by this. Utilizing ArgoCD annotations for sync phases and sync waves on each of the MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources will ensure that tekton's ConfigMaps can properly get created preventing the race condition/deadlock on a cold install.

I used this annotation specifically and had some success:

argocd.argoproj.io/hook: PostSync

Not the best solution and may not solve the upgrade workflow. ideally this would be something that is handled internally inside of tekton's controllers.