Monitoring and reconciliation

The injector is important for Pods with annotations vault.hashicorp.com/.... To make the injector a less critical component for the cluster, the FailurePolicy for the webhook should be set to Ignore (which is the case in the Helm deployment).

If the injector is unavailable, pods which need the agent will be created but probably fail to run properly. Liveness and readiness probes will not help in this case -- they do not recreate pods. Without looking closely at the resulting pod spec, the only indication for the cause is a log line from the api-server. Metrics are only available for webhooks with a FailurePolicy set to Fail.

This issue is to discuss approaches to monitor and/or re-conciliate unavailabilities of the webhook. Here is one approach:

Continuously check for Pods which have the annotation vault.hashicorp.com/agent-inject: "true" but are missing vault.hashicorp.com/agent-inject-status: injected and expose a metric for them (e.g. to trigger alerts)
Optionally have the injector delete those pods to trigger a recreation

I'm sure there are many other approaches. Would be interested to hear them!

hashicorp / vault-k8s

Monitoring and reconciliation #40