spiffe / spire-controller-manager

Kubernetes controller manager that reconciles workload registration and federation relationships.
Apache License 2.0
54 stars 37 forks source link

SPIRE Controller Manager Nightly jumps into a crash loopback when ClusterStaticEntries CRD is missing. #177

Open v0lkan opened 1 year ago

v0lkan commented 1 year ago

The component was working as expected ~5 days ago (today is Jul, 9, 2023).

The YAML files used to deploy SPIRE can be found at this snapshot:

https://github.com/shieldworks/aegis/tree/fbeb28f97761a768498aa9f03ca7521f41b641d6/k8s/spire

What happens:

SPIRE Server crashes. Here are the logs related to SPIRE controller manager

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  36m                 default-scheduler  Successfully assigned spire-system/spire-server-6fb4f57c8-6dcpc to minikube
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-server:1.6.3"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-server:1.6.3" in 2.192539709s (3.846152012s including waiting)
  Normal   Created    36m                 kubelet            Created container spire-server
  Normal   Started    36m                 kubelet            Started container spire-server
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-controller-manager:nightly"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-controller-manager:nightly" in 2.192111448s (2.963491192s including waiting)
  Normal   Created    26m (x5 over 36m)   kubelet            Created container spire-controller-manager
  Normal   Started    26m (x5 over 36m)   kubelet            Started container spire-controller-manager
  Normal   Pulled     26m (x4 over 34m)   kubelet            Container image "ghcr.io/spiffe/spire-controller-manager:nightly" already present on machine
  Warning  BackOff    23s (x75 over 32m)  kubelet            Back-off restarting failed container spire-controller-manager in pod spire-server-6fb4f57c8-6dcpc_spire-system(ed1688e0-1e49-4beb-9585-dbcedebd4af3)
~/WORKSPACE/aegis (main) 🐢⚡️ k logs spire-server-6fb4f57c8-6dcpc -n spire-system -c spire-controller-manager
2023-07-09T21:47:26Z    INFO    setup   Config loaded   {"cluster name": "aegis-cluster", "cluster domain": "cluster.local", "trust domain": "aegis.ist", "ignore namespaces": ["kube-system", "kube-public", "spire-system", "local-path-storage", "kube-node-lease", "kube-public", "kubernetes-dashboard"], "gc interval": "10s", "spire server socket path": "/spire-server/api.sock"}
2023-07-09T21:47:26Z    INFO    setup   Dialing SPIRE Server socket
2023-07-09T21:47:26Z    INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8082"}
2023-07-09T21:47:26Z    INFO    webhook-manager Minting webhook certificate {"reason": "initializing", "dnsNames": ["spire-controller-manager-webhook-service.spire-system.svc"]}
2023-07-09T21:47:26Z    INFO    webhook-manager Minted webhook certificate
2023-07-09T21:47:26Z    INFO    controller-runtime.builder  skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called   {"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterFederatedTrustDomain"}
2023-07-09T21:47:26Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterFederatedTrustDomain", "path": "/validate-spire-spiffe-io-v1alpha1-clusterfederatedtrustdomain"}
2023-07-09T21:47:26Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-spire-spiffe-io-v1alpha1-clusterfederatedtrustdomain"}
2023-07-09T21:47:26Z    INFO    controller-runtime.builder  skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called   {"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterSPIFFEID"}
2023-07-09T21:47:26Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterSPIFFEID", "path": "/validate-spire-spiffe-io-v1alpha1-clusterspiffeid"}
2023-07-09T21:47:26Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-spire-spiffe-io-v1alpha1-clusterspiffeid"}
2023-07-09T21:47:26Z    INFO    setup   starting manager
2023-07-09T21:47:26Z    INFO    controller-runtime.webhook.webhooks Starting webhook server
2023-07-09T21:47:26Z    INFO    starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8082"}
2023-07-09T21:47:26Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
I0709 21:47:26.492611     229 leaderelection.go:245] attempting to acquire leader lease spire-system/98c9c988.spiffe.io...
2023-07-09T21:47:26Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
2023-07-09T21:47:26Z    INFO    controller-runtime.webhook  Serving webhook server  {"host": "", "port": 9443}
I0709 21:47:44.050693     229 leaderelection.go:255] successfully acquired lease spire-system/98c9c988.spiffe.io
2023-07-09T21:47:44Z    INFO    Starting EventSource    {"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "source": "kind source: *v1alpha1.ClusterSPIFFEID"}
2023-07-09T21:47:44Z    INFO    Starting Controller {"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID"}
2023-07-09T21:47:44Z    INFO    Starting EventSource    {"controller": "clusterstaticentry", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterStaticEntry", "source": "kind source: *v1alpha1.ClusterStaticEntry"}
2023-07-09T21:47:44Z    INFO    Starting Controller {"controller": "clusterstaticentry", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterStaticEntry"}
2023-07-09T21:47:44Z    INFO    Starting EventSource    {"controller": "clusterfederatedtrustdomain", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterFederatedTrustDomain", "source": "kind source: *v1alpha1.ClusterFederatedTrustDomain"}
2023-07-09T21:47:44Z    INFO    Starting Controller {"controller": "clusterfederatedtrustdomain", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterFederatedTrustDomain"}
2023-07-09T21:47:44Z    INFO    Starting EventSource    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2023-07-09T21:47:44Z    INFO    Starting Controller {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
2023-07-09T21:47:44Z    DEBUG   events  spire-server-6fb4f57c8-6dcpc_111c5818-baeb-4a17-a464-921151f83677 became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"spire-system","name":"98c9c988.spiffe.io","uid":"5e3e9d0f-e23b-4970-8709-ef5dc1a4a9a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"7484"}, "reason": "LeaderElection"}
2023-07-09T21:47:44Z    INFO    webhook-manager Received webhook added event
2023-07-09T21:47:44Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterStaticEntry.spire.spiffe.io", "error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/loop.go:62
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/loop.go:63
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /go/pkg/mod/k8s.io/apimachinery@v0.27.3/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/internal/source/kind.go:56
2023-07-09T21:47:44Z    ERROR   entry-reconciler    Failed to list ClusterStaticEntries {"error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
github.com/spiffe/spire-controller-manager/pkg/spireentry.(*entryReconciler).reconcile
    /workspace/pkg/spireentry/reconciler.go:89
github.com/spiffe/spire-controller-manager/pkg/reconciler.(*reconciler).Run
    /workspace/pkg/reconciler/reconciler.go:84
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/manager.go:382
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/runnable_group.go:219
2023-07-09T21:47:44Z    INFO    Starting workers    {"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "worker count": 1}
2023-07-09T21:47:44Z    DEBUG   Triggering reconciliation   {"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "ClusterSPIFFEID": {"name":"aegis-safe"}, "namespace": "", "name": "aegis-safe", "reconcileID": "24ae242b-1917-45f6-9533-86ed7f4310ab"}
2023-07-09T21:47:44Z    DEBUG   Triggering reconciliation   {"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "ClusterSPIFFEID": {"name":"aegis-sentinel"}, "namespace": "", "name": "aegis-sentinel", "reconcileID": "e509004d-66f0-40ef-8fc4-7eb675f8b6d0"}
2023-07-09T21:47:44Z    INFO    Starting workers    {"controller": "clusterfederatedtrustdomain", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterFederatedTrustDomain", "worker count": 1}
2023-07-09T21:47:44Z    ERROR   entry-reconciler    Failed to list ClusterStaticEntries {"error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
github.com/spiffe/spire-controller-manager/pkg/spireentry.(*entryReconciler).reconcile
    /workspace/pkg/spireentry/reconciler.go:89
github.com/spiffe/spire-controller-manager/pkg/reconciler.(*reconciler).Run
    /workspace/pkg/reconciler/reconciler.go:84
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/manager.go:382
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/runnable_group.go:219
2023-07-09T21:47:44Z    INFO    Starting workers    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
2023-07-09T21:47:44Z    DEBUG   Triggering reconciliation   {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"aegis-sentinel-547bc8f7f6-84nj9","namespace":"aegis-system"}, "namespace": "aegis-system", "name": "aegis-sentinel-547bc8f7f6-84nj9", "reconcileID": "6efc20f1-a69a-4e78-9b24-782715247a1f"}
2023-07-09T21:47:44Z    DEBUG   Triggering reconciliation   {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"aegis-safe-6b4bc89c78-7gpl5","namespace":"aegis-system"}, "namespace": "aegis-system", "name": "aegis-safe-6b4bc89c78-7gpl5", "reconcileID": "801f2c0b-f03e-452b-a397-c6b44dd9361b"}
2023-07-09T21:47:44Z    ERROR   entry-reconciler    Failed to list ClusterStaticEntries {"error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
github.com/spiffe/spire-controller-manager/pkg/spireentry.(*entryReconciler).reconcile
    /workspace/pkg/spireentry/reconciler.go:89
github.com/spiffe/spire-controller-manager/pkg/reconciler.(*reconciler).Run
    /workspace/pkg/reconciler/reconciler.go:84
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/manager.go:382
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/runnable_group.go:219

Expectation:

SPIRE server should have given a warning (along the lines of “ClusterStaticEntry CRD is missing, please download at install it from {URL}.”

Or SPIRE Controller Manager container should have done a self-diagnosis and exit with a reason

Or both. Or something along those lines.

Other Notes and Resolutions:

v0lkan commented 1 year ago

Also, this is a breaking change (but it’s understandable to be so since it’s a nightly build); not sure the best way to handle it though since it is up to the user to add that CRD in the first place.

azdagron commented 1 year ago

This should hopefully be as easy as detecting this particular failure reason when listing the CRDs during reconciliation and treating it as "no CRDs present".

MarcosDY commented 1 year ago

We initially released it without this feature and then added documentation to ensure that users always upgrade CRDs when upgrading versions.