grafana / tanka

Flexible, reusable and concise configuration for Kubernetes
https://tanka.dev
Apache License 2.0
2.41k stars 166 forks source link

exit status 1 on tanka delete without helpful log #1229

Closed Its-Alex closed 6 days ago

Its-Alex commented 2 weeks ago

Hello 👋

I'm currently experiencing a bug that I can't solve, when installing hivemq/hivemq-operator version 0.11.14 on a cluster with tanka I can't delete it:

$ tk delete --name kind-local environments/dev
...
Deleting from namespace 'default' of cluster 'kind-local' at 'https://127.0.0.1:6443' using context 'kind-local'.
Please type 'yes' to confirm: yes
Error: exit status 1

You can find every versions of the tooling I use in https://github.com/Its-Alex/bug-tanka-delete-without-log/blob/main/.mise.toml and the bug is reproducible in https://github.com/Its-Alex/bug-tanka-delete-without-log/.

Tell me if you need any other information, for now I will try to find where the error happened in tanka source code.

Thanks for your previous time.

Its-Alex commented 2 weeks ago

When deleting resources directly from exported files (from tk export) I have an error on a file with CRD:

$ kubectl delete -f hivemq.com-v1.HiveMQCluster-hivemq-operator.yaml
Error from server (NotFound): error when deleting "hivemq.com-v1.HiveMQCluster-hivemq-operator.yaml": the server could not find the requested resource (delete hivemq-clusters.hivemq.com hivemq-operator)

I will try to dig further

Its-Alex commented 2 weeks ago

Using

$ tk delete --name kind-local environments/dev -t '!HiveMQCluster/.+'
...
Deleting from namespace 'default' of cluster 'kind-local' at 'https://127.0.0.1:6443' using context 'kind-local'.
Please type 'yes' to confirm: yes
Delete failed: Error from server (NotFound): deployments.apps "hivemq-operator-operator" not found
Delete failed: Error from server (NotFound): services "hivemq-operator-operator" not found
Delete failed: Warning: deleting cluster-scoped resources, not scoped to the provided namespace
Error from server (NotFound): clusterroles.rbac.authorization.k8s.io "hivemq-operator-operator" not found
Delete failed: Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "hivemq-clusters.hivemq.com" not found
Delete failed: Error from server (NotFound): configmaps "hivemq-operator-operator-templates" not found
Delete failed: Error from server (NotFound): serviceaccounts "hivemq-operator-operator" not found
Delete failed: Error from server (NotFound): serviceaccounts "hivemq-operator-hivemq" not found

That works, but I have two questions:

zerok commented 1 week ago

Sorry for the delay 🙂 I'll try to find time for debugging this this week.

Its-Alex commented 1 week ago

@zerok No problem, thanks for your answer 🙏

zerok commented 1 week ago

Thank YOU for that great test project! I've played around with the logging there and so far I see this error:

error: the server doesn't have a resource type "HiveMQCluster"

I get the same also from kubectl:

❯ kubectl --namespace kind-local get HiveMQCluster
error: the server doesn't have a resource type "HiveMQCluster"

When I disable the auto-creation (operator.deployCr = false) and manually create an instance of the HiveMQCluster kind, it works but delete again returns the same error as before.

From what I can see so far, this problem comes from the way that resource is named:

kind: HiveMQCluster
singular: hivemq-cluster

If I add something like this to the delete logic in Tanka...

    if kind == "HiveMQCluster" {
        kind = "hivemq-cluster"
    }

... then the deletion works. Underneath we are using kubectl for pretty much all operations. This means that we'd either have to translate the Kind to the Singular/Plural form there or find a way within kubectl that supports the Kind of a resource.

From what I can tell, we might be lucky in that there is support for the Type.Version.Group format:

❯ k get HiveMQCluster.v1.hivemq.com
NAME   SIZE   IMAGE            VERSION   STATUS     ENDPOINT   MESSAGE
test   3      hivemq/hivemq4   4.3.3     Updating              Waiting for deployment to become ready, ready: 0/3

❯ k get HiveMQCluster
error: the server doesn't have a resource type "HiveMQCluster"

So perhaps we can get away by modifying the delete-abstraction to not only use the Kind of a resource but also its Type and Version.

I have a working prototype that seems to fix that. Will iterate on it a bit and then create a PR 🙂

Acceptance criteria