teamhephy / controller

Hephy Workflow Controller (API)
https://teamhephy.com
MIT License
14 stars 26 forks source link

After deis apps:destroy the namespace stuck at Terminating forever #76

Closed edisonwang closed 5 years ago

edisonwang commented 5 years ago

Having this issue from right beginning of using deis, and find a related issue in following link: https://github.com/kubernetes/kubernetes/issues/60807#issuecomment-408599873

It seems caused by this snippet in the namespace configuration, but not sure who and how it end up there.

  "finalizers": [
            "kubernetes"
        ]

The cluster has Rook.io and following helm chart installed:

$kubectl -n rook-ceph exec -it rook-ceph-mon0-snxrq rook version
 rook: v0.8.1
$ helm ls 
NAME                REVISION    UPDATED                     STATUS      CHART                       NAMESPACE
alertmanager        1           Thu Aug 23 01:33:20 2018    DEPLOYED    alertmanager-0.1.7          monitoring
grafana             1           Thu Aug 23 01:33:30 2018    DEPLOYED    grafana-0.0.37              monitoring
hephy               3           Tue Sep  4 23:50:11 2018    DEPLOYED    workflow-v2.19.4            deis
kube-prometheus     1           Thu Aug 23 01:34:16 2018    DEPLOYED    kube-prometheus-0.0.105     monitoring
kube-system         1           Sun Sep 30 13:40:09 2018    FAILED      kubedb-catalog-0.9.0-beta.0 default
kubedb-catalog      1           Sun Sep 30 13:34:12 2018    DEPLOYED    kubedb-catalog-0.9.0-beta.0 default
kubedb-operator     15          Sun Sep 30 13:50:37 2018    DEPLOYED    kubedb-0.8.0                kube-system
phpmyadmin          1           Tue Sep 18 12:02:55 2018    DEPLOYED    phpmyadmin-0.1.10           phpmyadmin
prometheus          1           Thu Aug 23 01:33:09 2018    DEPLOYED    prometheus-0.0.51           monitoring
prometheus-operator 1           Thu Aug 23 01:32:47 2018    DEPLOYED    prometheus-operator-0.0.29  monitoring
traefik             5           Sun Sep 30 00:44:44 2018    DEPLOYED    traefik-1.44.0              default
$ deis create --no-remote                                                                           
Creating Application... done, created excess-ironwood
If you want to add a git remote for this app later, use `deis git:remote -a excess-ironwood`
$  kubectl get ns excess-ironwood
NAME              STATUS   AGE
excess-ironwood   Active   23s
 $  deis apps:destroy -a excess-ironwood
 !    WARNING: Potentially Destructive Action
 !    This command will destroy the application: excess-ironwood
 !    To proceed, type "excess-ironwood" or re-run this command with --confirm=excess-ironwood

> excess-ironwood
Destroying excess-ironwood...
done in 11s
$  kubectl get ns excess-ironwood
NAME              STATUS        AGE
excess-ironwood   Terminating   87s
 $ kubectl get ns excess-ironwood -o json 
{
    "apiVersion": "v1",
    "kind": "Namespace",
    "metadata": {
        "creationTimestamp": "2018-10-02T00:06:03Z",
        "deletionTimestamp": "2018-10-02T00:07:25Z",
        "labels": {
            "heritage": "deis"
        },
        "name": "excess-ironwood",
        "resourceVersion": "8323427",
        "selfLink": "/api/v1/namespaces/excess-ironwood",
        "uid": "f3cb5433-c5d6-11e8-9cb4-fa163e5ab78f"
    },
    "spec": {
        "finalizers": [
            "kubernetes"
        ]
    },
    "status": {
        "phase": "Terminating"
    }
}
kingdonb commented 5 years ago

Thank you for the output of helm ls! That's very helpful

We will investigate this. As you and I discussed, we are not sure this is an issue that has anything to do with Workflow itself, so there may or may not be any change required in the controller and workflow-cli repos. One possible resolution would be to add a flag that you can pass into workflow-cli to let it know that it needs to clean up a finalizer, like: deis apps:destroy -a <appname> -f kubernetes to clean up the kubernetes finalizer as you've shown here.

Perhaps better would be to find out why this seems to be a new issue, and get it fixed upstream. From our research, we found that it seems the finalizer is doing exactly what it's designed for:

https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#finalizers

Finalizers allow controllers to implement asynchronous pre-delete hooks. Custom objects support finalizers just like built-in objects.

You can add a finalizer to a custom object like this:

apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
  finalizers:
  - finalizer.stable.example.com

Finalizers are arbitrary string values, that when present ensure that a hard delete of a resource is not possible while they exist.

So presumably some process has left this behind, when it should be cleaned up automatically. Maybe there is something else going on. I would hesitate before I advise you to just go ahead and delete the finalizer with kubectl or like as with proposed deis apps:destroy -a <appname> -f kubernetes.

Thanks again for the report and documentation!

edisonwang commented 5 years ago

Finally figured out what’s going on(sort of), I've just noticed there was an error in my metrics-server pod (showed CrashLoopBackOff) due to an incomplete deploy yaml ( same issue described here https://github.com/kubernetes-incubator/metrics-server/issues/105 ) and somehow this caused the finalizer problem…. After I fixed this all the stucking namespaces just gone…. And I tried to create new and destroy all went pretty smooth….

kingdonb commented 5 years ago

So it looks like metrics-server is installed on most 1.8+ clusters by default

https://kubernetes.io/docs/tasks/debug-application-cluster/core-metrics-pipeline/#metrics-server

I am just going to keep this open until it's clearer to me how this is reproduced (and what upstreams need to change so new users won't hit this issue anymore)

kingdonb commented 5 years ago

After reading through some more of the reports, it looks like this was the bad diff: https://github.com/kubernetes-incubator/metrics-server/commit/a823af80d438d642c29e038ca5336004b2a8b97e#diff-241930222cfd9fbea1ea3654fbabff5b

and it's appears that it's being tracked in kubernetes-incubator/metrics-server#97 (although it does not look like it was fixed yet)

Cryptophobia commented 5 years ago

This turns out to be an issue with metrics-server rather than Hephy.