hashicorp / vault-plugin-auth-kubernetes

Vault authentication plugin for Kubernetes Service Accounts
https://www.vaultproject.io/docs/auth/kubernetes.html
Mozilla Public License 2.0
206 stars 62 forks source link

TLS errors after failed plugin initialization #169

Closed mafos closed 1 year ago

mafos commented 1 year ago

Hi, I'm seeing persistent login failures with 403 responses after leadership failover in Vault v1.12.1.

Server log error message:

login unauthorized: err="Post \"https://kube-apiserver.example/apis/authentication.k8s.io/v1/tokenreviews\": x509: certificate signed by unknown authority"

This is with unchanged mount & role configuration that was previously functional. The configured kubernetes_ca_cert value served via /config also aligns with the live Kubernetes API.

It appears to be related to intermittent storage errors at plugin initialization and a change in HTTP client initialization with https://github.com/hashicorp/vault-plugin-auth-kubernetes/pull/142.

Server logs near unseal time (azure storage backend in this case):

error=
| -> github.com/Azure/azure-pipeline-go/pipeline.NewError, /home/runner/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.3/pipeline/error.go:157
| HTTP request failed
| 
| Get "https://examplecontainer.blob.core.windows.net/vault-data/auth/kubernetes/config?timeout=61": context canceled

Executing vault plugin reload is necessary to fix the 403 responses and re-enable successful TokenReview requests to the Kubernetes API. The plugin mount does not recover on its own otherwise.

Steps to reproduce

  1. Provision a local Kubernetes cluster using kind and serviceaccount tokens ```bash # create k8s cluster via kind kind create cluster --name demo # create vault serviceaccount & binding kubectl create sa vault kubectl create clusterrolebinding \ system:auth-delegator:vault \ --clusterrole=system:auth-delegator \ --serviceaccount=default:vault # create vault reviewer token VAULT_REVIEWER_JWT=$(kubectl create token vault) ```
  2. Start a vault server using file storage ```bash cat <<- EOF > vault.hcl listener "tcp" { tls_disable = true address = "127.0.0.1:8200" } storage "file" { path = "./vault-data" } log_level = "Debug" EOF vault server -config=vault.hcl ```
  3. Test login after forced plugin config storage read failure on initialization ```bash # initialize & unseal vault vault operator init -key-shares=1 -key-threshold=1 vault operator unseal $VAULT_UNSEAL_KEY # configure kubernetes auth backend vault auth enable kubernetes vault write auth/kubernetes/config \ kubernetes_host="$(kubectl config view -ojson | jq -r '.clusters[] | select(.name == "kind-demo") | .cluster.server')" \ kubernetes_ca_cert="$(kubectl get cm kube-root-ca.crt -ojson | jq -r '.data["ca.crt"]')" \ token_reviewer_jwt=$VAULT_REVIEWER_JWT # create kubernetes "default" auth role vault write auth/kubernetes/role/default \ bound_service_account_names=default \ bound_service_account_namespaces=default \ policies=default # generate test serviceaccount token DEFAULT_JWT=$(kubectl create token default) # attempt login => success vault write auth/kubernetes/login role=default jwt=$DEFAULT_JWT # stop vault pkill vault # disable reads of underlying backend config to fail plug initialize chmod 200 vault-data/auth/*/_config # restart vault separately & unseal vault operator unseal $VAULT_UNSEAL_KEY # re-enable reads of underlying backend config chmod 600 vault-data/auth/*/_config # attempt login => failure (x509: “kube-apiserver” certificate is not trusted) vault write auth/kubernetes/login role=default jwt=$DEFAULT_JWT ```
heatherezell commented 1 year ago

Thank you for reporting this! Our engineers are currently discussing a fix. It might not be able to go out in 1.12.2, but we will do our best to get this fixed quickly. Thanks again!

nsimons commented 1 year ago

@hsimon-hashicorp, we bumped into a similar issue with the cached HTTP client improvement in https://github.com/hashicorp/vault-plugin-auth-kubernetes/pull/142 not reloading the CA certificate from local disk. I have a fix proposal about 90% complete that solves both that issue and this one. Would you be open to a contribution or are you handling it internally?

Should I also create a separate issue about that, since it's a different use case?

anthonyralston commented 1 year ago

@hsimon-hashicorp We have also seen this issue in our Vault clusters--We initially contacted HashiCorp support about it 15 days ago. +1 for a fix ASAP.