hashicorp / vault-plugin-auth-kubernetes

Vault authentication plugin for Kubernetes Service Accounts
https://www.vaultproject.io/docs/auth/kubernetes.html
Mozilla Public License 2.0
206 stars 62 forks source link

BREAKING CHANGE: proposal: make `kubernetes_host` array to provide fallback mechanism and retry resiliency #157

Open Dentrax opened 2 years ago

Dentrax commented 2 years ago

Abstract

In current implementation of kubernetes_host only takes string type as we see in the scheme. The problem here is that we can only pass:

This may be normally achieved via a service/LB in front of master nodes in the cluster. But this requires a lot of hard work in typical infrastructure. People also may don't want to bring additional overhead to bring some additional resiliency just for Vault Kubernetes auth method.

Problem

We are (@yilmazo @erkanzileli @developer-guy) filed this issue because one of the our master nodes (the one we set to kubernetes_host variable) has down and caused an incident in slight time window. We actually don't have a LB in front of master nodes. If we had, we probably wouldn't have this issue.

What if we had a LB, and it gets down in this case? We still don't cover this scenario.

Solution

We should provide a solution that covers the following two scenarios:

We should implement a fallback mechanism and provide some resiliency methods:

  1. BREAKING!: make kubernetes_host as a string array: []string

If we get 5xx or similar error from host[0], fallback to host[1]: try host[index] -> host[index+1] until last one.

  1. Resiliency: we should retry while talking with Kubernetes API.

For example, we can implement retryable http when calling API in token review function.

Both following above ideas are essential to provide highly resilient system since we use Vault Kubernetes Auth with a production Kubernetes cluster.

Alternative Solution

  1. create TCP load balancer infra from scratch to put in front of master nodes and ignore this issue
  2. create a Kubernetes Operator from scratch to watch shared informers. If any change observed across master nodes (i.e, if one of down), call Vault API to update kubernetes_host key in auth/kubernetes/config path ^2 with another master node that is actively in running and healthy state

Similar problem has been previously discussed at https://github.com/hashicorp/vault/issues/5408 almost 3 years ago, so we came up with the new proposal since the main issue hasn't been resolved yet.

cc @briankassouf @catsby @jefferai fyi @mitchellmaler @m1kola