projectsyn / lieutenant-operator

The Project Syn Inventory API Operator
https://docs.syn.tools/lieutenant-operator/
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Panic on Cluster Deletion #183

Closed glrf closed 2 years ago

glrf commented 2 years ago

When deleting a cluster the operator panics because of a nil pointer deference, while trying to remove the steward secret.

Steps to Reproduce the Problem

It is unclear which steps exactly lead to this. The general steps where:

  1. Create cluster
  2. Reset the bootstrap token a day later
  3. Remove all secrets referenced by the cluster catalog in Vault (according to https://kb.vshn.ch/vshnsyn/how-tos/decommission.html)
  4. Delete cluster

Actual Behavior

The operator panics while trying to handle the deletion of a cluster.

Cluster Resource

apiVersion: syn.tools/v1alpha1
kind: Cluster
metadata:
  creationTimestamp: "2021-07-27T07:10:58Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-07-28T13:58:00Z"
  finalizers:
  - cluster.lieutenant.syn.tools
  generation: 6
  labels:
    syn.tools/tenant: t-ancient-morning-1764
  name: c-cold-morning-3608
  namespace: lieutenant-int
  ownerReferences:
  - apiVersion: syn.tools/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Tenant
    name: t-ancient-morning-1764
    uid: 4e48dae4-8604-4dfb-9422-51792da07c5d
  resourceVersion: "315984491"
  selfLink: /apis/syn.tools/v1alpha1/namespaces/lieutenant-int/clusters/c-cold-morning-3608
  uid: ae1f94ff-4725-45a9-967d-aa09b12ee096
spec:
  displayName: APPUiO OCP4 Exoscale Setup Test
  facts:
    cloud: exoscale
    distribution: openshift4
    lieutenant-instance: lieutenant-int
    region: ch-gva-2
    service_level: zero
  gitHostKeys: |
    git.vshn.net ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCnE1dMkh+3uHWck+cTvQqeNUW0lj1uVcIC9JX2Tg6gmkKCYA73+o+I7vo4g6nPtSOAfITvYdHJLzwE9GwlSFsXHMR9q0ErWl2wC+w6FawLMz9//5XqiBi2qq/8WnWp3ecY16jDoGRW4eymT+USFHKJVi696XBy3WE/0BBapPZ58WPqkKN6A27qkIK6FehI80f+zN4ZqikdwWuCFs35fsimcmLnWqWPm8zbOkgCiB+ov4O/xmRNHwJWCk/qzU6X/M9YtMXzAa5mjwDvcHSAizFD3a3Fv68G1VsmRZ0THLrRKM/WOxrWNZoimSNgyjTzoCwiKeckvL5+hpNcNSW+eBPt
    git.vshn.net ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIO9EkPcVdsz/oVTI2VJkBlq8Mv/dg3rhcbgzAEKyiwUG
  gitRepoTemplate:
    apiSecretRef:
      name: vshn-gitlab
    deletionPolicy: Delete
    deployKeys:
      steward:
        key: |
          AAAAB3NzaC1yc2EAAAADAQABAAACAQC9WS07j2M7Sz5+ox8ew7y3bZJ0OHA6lkSNAXu+eUvVTOqlCMFjujaNZo5tX+019e/KZnhi2/JtBK8mCTXAyzs3xrJvYbACIOwHa33IAhfyEmYa/KtNYJK3dhYVclh10+jUJqMo5cK/41vIw2ApCEMykpbU4rPFsomjGp7igcGq9Zb3vyvf1dtgVJ0bf5psb45a6dsnKSoHMqxGkrfTj9kb/kURMSLkGGSxEdhUSkonbAdNToq+2TjTJEPWM488r7MlG/rsd/7+RQhzHGD8V/+90dDzyJ5YnEfpkCrPB2UxVJWRt5ccMTpnuPtuzrn57NY0mruWpK0JmlnryoEv0aoDaT80YSqZVy09WKcXc4bZ2oR2oaFJkKEfBfo5nVsEaX6fEUYUxX8ALgF1+7SDBwYwg2+/km4o/flUE3UhP0fqpkTGOlB9W/hkZe6ksNPcSWtuPJmeVRghHoY19Kw2nIZ0+3CDpEIWopfmzZNzI8A0T0xv1gNFyM+NiIrDb0ju2YL090WJ4X9iwdIluRMmoL0Nu2yLB/YIHRlFPszZma6c2ZPTeq2O0o7zxqk0ynT8GGiyD4Ns1X+ei2k5uAi7pxG2KdrrF0NpNPBpCHefe184ZtSie5ySkn2agTayyLZGJrbqxa/9uI+2KgIyhyvI5MaMziOP9PhrIhcjtombNedukQ==
        type: ssh-rsa
    displayName: APPUiO OCP4 Exoscale Setup Test
    path: syn-dev/cluster-catalogs
    repoName: c-cold-morning-3608
    repoType: auto
  gitRepoURL: ssh://git@git.vshn.net/syn-dev/cluster-catalogs/c-cold-morning-3608.git
  tenantRef:
    name: t-ancient-morning-1764
status:
  bootstrapToken:
    token: <token>
    validUntil: "2021-07-29T13:10:21Z"

Vault Secret

$ vault kv get clusters/kv/t-ancient-morning-1764/c-cold-morning-3608/steward 
====== Metadata ======
Key              Value
---              -----
created_time     2021-07-27T07:10:58.227801984Z
deletion_time    n/a
destroyed        false
version          1

==== Data ====
Key      Value
---      -----
token    <token>

The original secret should still be present on vault-int.

Log

{"level":"info","ts":1627541860.7444935,"logger":"controller_cluster","msg":"Reconciling Cluster","Request.Namespace":"lieutenant-int","Request.Name":"c-cold-morning-3608"}
E0729 06:57:40.976044       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 2089 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1551940, 0x223a5e0)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1551940, 0x223a5e0)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/projectsyn/lieutenant-operator/pkg/vault.(*BankVaultClient).removeSecret(0xc000bf6680, 0xc0007350c0, 0x2a, 0x0, 0x0, 0x20, 0x14fb360)
    /app/pkg/vault/client.go:178 +0x1d0
github.com/projectsyn/lieutenant-operator/pkg/vault.(*BankVaultClient).RemoveSecrets(0xc000bf6680, 0xc0004af4a0, 0x1, 0x1, 0xc000bf6680, 0x0)
    /app/pkg/vault/client.go:153 +0x7e
github.com/projectsyn/lieutenant-operator/pkg/vault.HandleVaultDeletion(0x192d338, 0xc000b396c0, 0xc00047dd40, 0x203000, 0x0, 0x0, 0xc00042de00)
    /app/pkg/vault/reconcile_steps.go:99 +0x2b9
github.com/projectsyn/lieutenant-operator/pkg/pipeline.RunPipeline(0x192d338, 0xc000b396c0, 0xc00047dd40, 0xc0006a19a8, 0x8, 0x8, 0x0, 0xc0006a19c0, 0x40e078, 0x30)
    /app/pkg/pipeline/pipeline.go:61 +0xb1
github.com/projectsyn/lieutenant-operator/pkg/controller/cluster.clusterSpecificSteps(0x192d338, 0xc000b396c0, 0xc00047dd40, 0x0, 0x0, 0x0, 0x0)
    /app/pkg/controller/cluster/cluster_reconcile.go:70 +0x1d8
github.com/projectsyn/lieutenant-operator/pkg/pipeline.RunPipeline(0x192d338, 0xc000b396c0, 0xc00047dd40, 0xc0006a1bd8, 0x6, 0x6, 0x13, 0x18f7b60, 0xc000b396c0, 0x0)
    /app/pkg/pipeline/pipeline.go:61 +0xb1
github.com/projectsyn/lieutenant-operator/pkg/controller/cluster.(*ReconcileCluster).Reconcile(0xc000678558, 0xc000c07810, 0xe, 0xc000755170, 0x13, 0x1aeac146c1, 0xc000364480, 0xc00014e788, 0xc00014e750)
    /app/pkg/controller/cluster/cluster_reconcile.go:52 +0x570
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000660300, 0x15bb220, 0xc0000ab620, 0x0)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000660300, 0xc000308500)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(...)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00067a420)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00067a420, 0x3b9aca00, 0x0, 0x1, 0xc000114420)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:153 +0x105
k8s.io/apimachinery/pkg/util/wait.Until(0xc00067a420, 0x3b9aca00, 0xc000114420)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:193 +0x32d
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x130e9f0]

goroutine 2089 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1551940, 0x223a5e0)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/projectsyn/lieutenant-operator/pkg/vault.(*BankVaultClient).removeSecret(0xc000bf6680, 0xc0007350c0, 0x2a, 0x0, 0x0, 0x20, 0x14fb360)
    /app/pkg/vault/client.go:178 +0x1d0
github.com/projectsyn/lieutenant-operator/pkg/vault.(*BankVaultClient).RemoveSecrets(0xc000bf6680, 0xc0004af4a0, 0x1, 0x1, 0xc000bf6680, 0x0)
    /app/pkg/vault/client.go:153 +0x7e
github.com/projectsyn/lieutenant-operator/pkg/vault.HandleVaultDeletion(0x192d338, 0xc000b396c0, 0xc00047dd40, 0x203000, 0x0, 0x0, 0xc00042de00)
    /app/pkg/vault/reconcile_steps.go:99 +0x2b9
github.com/projectsyn/lieutenant-operator/pkg/pipeline.RunPipeline(0x192d338, 0xc000b396c0, 0xc00047dd40, 0xc0006a19a8, 0x8, 0x8, 0x0, 0xc0006a19c0, 0x40e078, 0x30)
    /app/pkg/pipeline/pipeline.go:61 +0xb1
github.com/projectsyn/lieutenant-operator/pkg/controller/cluster.clusterSpecificSteps(0x192d338, 0xc000b396c0, 0xc00047dd40, 0x0, 0x0, 0x0, 0x0)
    /app/pkg/controller/cluster/cluster_reconcile.go:70 +0x1d8
github.com/projectsyn/lieutenant-operator/pkg/pipeline.RunPipeline(0x192d338, 0xc000b396c0, 0xc00047dd40, 0xc0006a1bd8, 0x6, 0x6, 0x13, 0x18f7b60, 0xc000b396c0, 0x0)
    /app/pkg/pipeline/pipeline.go:61 +0xb1
github.com/projectsyn/lieutenant-operator/pkg/controller/cluster.(*ReconcileCluster).Reconcile(0xc000678558, 0xc000c07810, 0xe, 0xc000755170, 0x13, 0x1aeac146c1, 0xc000364480, 0xc00014e788, 0xc00014e750)
    /app/pkg/controller/cluster/cluster_reconcile.go:52 +0x570
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000660300, 0x15bb220, 0xc0000ab620, 0x0)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:256 +0x166
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000660300, 0xc000308500)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:232 +0xb0
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(...)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00067a420)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:152 +0x5f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00067a420, 0x3b9aca00, 0x0, 0x1, 0xc000114420)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:153 +0x105
k8s.io/apimachinery/pkg/util/wait.Until(0xc00067a420, 0x3b9aca00, 0xc000114420)
    /go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:193 +0x32d

Expected Behavior

The operator is able to handle the cluster deletion or return an error without crashing if the cluster resource is in an inconsistent state.

ccremer commented 2 years ago

First investigation on our instance on lieutenant-dev

Lieutenant-dev was deployed with image tag latest, which pointed to https://hub.docker.com/layers/projectsyn/lieutenant-operator/latest/images/sha256-23de5bff707464e9068404179f247de4411441b22409e23f84e5179a7bb34b9d?context=explore. This image was built before the Operator SDK upgrade merged in #175 . So the bug report refers to a version that should have been tested with the master image tag (which gets pushed with every merge commit to master).

When deploying with the master tag, the mentioned bug doesn't appear. However, there are a number of other errors logged in the operator. Such errors are like "can't find Vault Secret" or "finalizer not removed".

It's currently unclear whether the Dev environment is simply outdated and should be cleaned manually or if the new version cannot handle the old data. Both cases do not give a good gut feeling to deploy the operator to Int or production environments.

It's currently a mess. I will try to isolate the errors and open new issues. It's currently not possible to reproduce this one.