pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
369 stars 206 forks source link

Helm Chart pre-delete hook results in "Error: job failed: DeadlineExceeded" #324

Open mogul opened 3 years ago

mogul commented 3 years ago

Description

We used Helm to install the zookeeper-operator chart on Kubernetes 1.19. When we helm uninstall zookeeper we see

% helm uninstall zookeeper -n kube-system                                                    
W0423 17:24:43.013279   86682 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0423 17:24:43.301953   86682 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0423 17:24:43.432049   86682 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W0423 17:24:43.890420   86682 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
Error: job failed: DeadlineExceeded

...and the release is stuck in state "uninstalling":

% helm ls -a -A        
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
zookeeper       kube-system     1               2021-04-23 15:57:38.450719 -0700 PDT    uninstalling    zookeeper-operator-0.2.10       0.2.10     
% helm status zookeeper -n kube-system --show-desc
NAME: zookeeper
LAST DEPLOYED: Fri Apr 23 15:57:38 2021
NAMESPACE: kube-system
STATUS: uninstalling
REVISION: 1
DESCRIPTION: Deletion in progress (or silently failed)
TEST SUITE: None

Importance

(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have))

blocker: We are trying to automate everything we do with terraform and this prevents us from being able to run terraform destroy without having to manually intervene to remove the release.

Location

(Where is the piece of code, package, or document affected by this issue?)

This appears to be a result of the code introduced in https://github.com/pravega/zookeeper-operator/pull/301

When we try uninstalling with debugging on we see:

% helm uninstall zookeeper -n kube-system --debug
uninstall.go:93: [debug] uninstall: Deleting zookeeper
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" ServiceAccount
client.go:122: [debug] creating 1 resource(s)
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" ConfigMap
client.go:122: [debug] creating 1 resource(s)
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" ClusterRole
W0423 17:11:21.360925   85624 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
client.go:122: [debug] creating 1 resource(s)
W0423 17:11:21.680620   85624 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" ClusterRoleBinding
W0423 17:11:21.812071   85624 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
client.go:122: [debug] creating 1 resource(s)
W0423 17:11:22.137746   85624 warnings.go:67] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" Job
client.go:297: [debug] jobs.batch "zookeeper-zookeeper-operator-pre-delete" not found
client.go:122: [debug] creating 1 resource(s)
client.go:477: [debug] Watching for changes to Job zookeeper-zookeeper-operator-pre-delete with timeout of 5m0s
client.go:505: [debug] Add/Modify event for zookeeper-zookeeper-operator-pre-delete: ADDED
client.go:544: [debug] zookeeper-zookeeper-operator-pre-delete: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:505: [debug] Add/Modify event for zookeeper-zookeeper-operator-pre-delete: MODIFIED
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" Job
Error: job failed: DeadlineExceeded
helm.go:81: [debug] job failed: DeadlineExceeded
helm.sh/helm/v3/pkg/kube.(*Client).waitForJob
        /private/tmp/helm-20201209-32798-18s27r6/pkg/kube/client.go:540
helm.sh/helm/v3/pkg/kube.(*Client).watchUntilReady.func1
        /private/tmp/helm-20201209-32798-18s27r6/pkg/kube/client.go:508
k8s.io/client-go/tools/watch.UntilWithoutRetry
        /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/k8s.io/client-go@v0.19.4/tools/watch/until.go:80
k8s.io/client-go/tools/watch.UntilWithSync
        /Users/brew/Library/Caches/Homebrew/go_mod_cache/pkg/mod/k8s.io/client-go@v0.19.4/tools/watch/until.go:151
helm.sh/helm/v3/pkg/kube.(*Client).watchUntilReady
        /private/tmp/helm-20201209-32798-18s27r6/pkg/kube/client.go:495
helm.sh/helm/v3/pkg/kube.(*Client).watchTimeout.func1
        /private/tmp/helm-20201209-32798-18s27r6/pkg/kube/client.go:305
helm.sh/helm/v3/pkg/kube.batchPerform.func1
        /private/tmp/helm-20201209-32798-18s27r6/pkg/kube/client.go:357
runtime.goexit
        /usr/local/Cellar/go/1.15.5/libexec/src/runtime/asm_amd64.s:1374

We looked at the pre-delete hook and saw that it's checking for existing Zookeeper instances... We didn't create any while the chart was installed, and when we run the command from the hook we can confirm there are none:

% kubectl get zookeepercluster --all-namespaces --no-headers                                 
No resources found

Suggestions for an improvement

(How do you suggest to fix or proceed with this issue?)

We can get around this manually for now by skipping the hooks during uninstall:

% helm uninstall zookeeper -n kube-system --no-hooks --debug                                 
uninstall.go:93: [debug] uninstall: Deleting zookeeper
uninstall.go:104: [debug] delete hooks disabled for zookeeper
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator" Deployment
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator" RoleBinding
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator" Role
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator" ClusterRoleBinding
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator" ClusterRole
client.go:268: [debug] Starting delete for "zookeeperclusters.zookeeper.pravega.io" CustomResourceDefinition
W0423 17:33:02.716749   87396 warnings.go:67] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
client.go:268: [debug] Starting delete for "zookeeper-operator" ServiceAccount
uninstall.go:130: [debug] purge requested for zookeeper
release "zookeeper" uninstalled

We can use the disable_webhooks option in the Terraform provider to get the same result, but that will skip all hooks (which is probably a bad thing to do... not sure what other hooks the chart has in it).

For our current situation the best workaround is to use the previous version of the chart, but we'd rather not miss out on future improvements, so we're hoping to see this fixed.

anishakj commented 3 years ago

@mogul Have you uninstalled zookeeper cluster, before uninstalling zookeeper operator. You can check by using kubectl get zk command

mogul commented 3 years ago

Yes:

We looked at the pre-delete hook and saw that it's checking for existing Zookeeper instances... We didn't create any while the chart was installed, and when we run the command from the hook we can confirm there are none:

anishakj commented 3 years ago

@mogul Could you please paste logs from pre-delete hook pod that gets created.?

SrishT commented 3 years ago

@mogul if the pre-delete hook is something do not need, you can easily disable it by setting hooks.delete to false while installing the zookeeper operator here but in order to understand why the job is failing for you, we would need to see the logs within pre-delete hook pod that gets created.

anishakj commented 3 years ago

@mogul Could you please provide us logs if you are still seeing the issue or else can we close this?

anishakj commented 3 years ago

Closing this issue as there is no response from submitter. Please feel free to open the issue with logs, if the issue is seen again.

mogul commented 3 years ago

Hello, I'm once again hitting this problem now that the solr-operator requires zookeeper-operator 0.2.12.

@mogul if the pre-delete hook is something do not need, you can easily disable it by setting hooks.delete to false while installing the zookeeper operator here

This was enormously helpful, thanks! I'm able to use this setting to stay on 0.2.12 now despite the pre-delete hook problem.

but in order to understand why the job is failing for you, we would need to see the logs within pre-delete hook pod that gets created.

I tried to capture logs of the pre-delete pod, but the time between the job starting and the DeadlineExceeded message in the logs quoted above is just a few seconds:

client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" Job
client.go:297: [debug] jobs.batch "zookeeper-zookeeper-operator-pre-delete" not found
client.go:122: [debug] creating 1 resource(s)
client.go:477: [debug] Watching for changes to Job zookeeper-zookeeper-operator-pre-delete with timeout of 5m0s
client.go:505: [debug] Add/Modify event for zookeeper-zookeeper-operator-pre-delete: ADDED
client.go:544: [debug] zookeeper-zookeeper-operator-pre-delete: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:505: [debug] Add/Modify event for zookeeper-zookeeper-operator-pre-delete: MODIFIED
client.go:268: [debug] Starting delete for "zookeeper-zookeeper-operator-pre-delete" Job
[...JUST A FEW SECONDS...]
Error: job failed: DeadlineExceeded
helm.go:81: [debug] job failed: DeadlineExceeded

The pod is created and then gone again so fast that I'm not sure how to capture them... Is there some kubectl magic that would help with that? Or maybe the deadline is being expressed in the wrong magnitude units...?

anishakj commented 3 years ago

@mogul Could you please try collecting the logs by removing the the delete annotation from the job "helm.sh/hook-delete-policy": hook-succeeded, before-hook-creation, hook-failed

anishakj commented 2 years ago

@mogul Could you please update the logs.