Injector does not inject sidecar container

ky0shiro commented 4 years ago

Hello, I'm trying to deploy vault with sidecar injector. I'm using this chart: https://github.com/hashicorp/vault-helm and following this manual: https://www.hashicorp.com/blog/injecting-vault-secrets-into-kubernetes-pods-via-a-sidecar/ the only difference is that I don't use dev server mode.

Everything works fine except the injector. When I deploy an app with injector annotations, then pod starts like usual with one container and with mounted app-token secret, but there is no secondary injector container:

app-57d4f4c645-9npng
Namespace:      my-namespace
Priority:       0
Node:           node
Start Time:     Mon, 06 Jan 2020 16:19:21 +0100
Labels:         app=vault-agent-demo
                pod-template-hash=57d4f4c645
Annotations:    vault.hashicorp.com/agent-inject: true
                vault.hashicorp.com/agent-inject-secret-test: secret/data/test-secret
                vault.hashicorp.com/role: test
Status:         Running
IP:             xxxxxx
IPs:            <none>
Controlled By:  ReplicaSet/app-57d4f4c645
Containers:
  app:
    Container ID:   docker://7348a9d4a9c0c9a3d831d3f84fa078081dcc3648f469aa2b0195b55242d26613
    Image:          jweissig/app:0.0.1
    Image ID:       docker-pullable://jweissig/app@sha256:54e7159831602dd8ffd8b81e1d4534c664a73e88f3f340df9c637fc16a5cf0b7
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 06 Jan 2020 16:19:22 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from app-token-kmzkr (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  app-token-kmzkr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  app-token-kmzkr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

There are no errors in logs from vault-agent-injector pod :

2020-01-06T13:55:55.369Z [INFO]  handler: Starting handler..
Listening on ":8080"...
Updated certificate bundle received. Updating certs...

Here is my deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
  namespace: my-namespace
  labels:
    app: vault-agent-demo
spec:
  selector:
    matchLabels:
      app: vault-agent-demo
  replicas: 1
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/agent-inject-secret-test: "secret/data/test-secret"
        vault.hashicorp.com/role: "test"
      labels:
        app: vault-agent-demo
    spec:
      serviceAccountName: app
      containers:
      - name: app
        image: jweissig/app:0.0.1
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: test
  namespace: my-namespace
  labels:
    app: vault-agent-demo

apiVersion: flux.weave.works/v1beta1
kind: HelmRelease
metadata:
  name: vault
  namespace: my-namespace
  annotations:
    flux.weave.works/automated: 'true'
spec:
  chart:
    path: "."
    git: git@github.com:hashicorp/vault-helm.git
    ref: master
  releaseName: vault
  values:
    replicaCount: 1
    server:
      ingress:
        enabled: true
        annotations:
          ....... 
        hosts:
          .......
        tls:
          .......

Is there any way to debug this issue?

jasonodonnell commented 4 years ago

Hi @ky0shiro, based on the logs you sent, it seems like the request never made it to your injector. You would get a log entry looking like this:

2020-01-06T15:10:18.658Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s

Can you provide the following:

kubectl describe service vault-agent-injector-svc
kubectl describe mutatingwebhookconfigurations vault-agent-injector-cfg

ky0shiro commented 4 years ago

@jasonodonnell : service:

Name:              vault-agent-injector-svc
Namespace:         my-namespace
Labels:            app.kubernetes.io/instance=vault
                   app.kubernetes.io/managed-by=Tiller
                   app.kubernetes.io/name=vault-agent-injector
Annotations:       flux.weave.works/antecedent: my-namespace:helmrelease/vault
Selector:          app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault-agent-injector,component=webhook
Type:              ClusterIP
IP:                10.210.4.175
Port:              <unset>  443/TCP
TargetPort:        8080/TCP
Endpoints:         10.16.0.198:8080
Session Affinity:  None
Events:            <none>

mutatingwebhookconfigurations:

Name:         vault-agent-injector-cfg
Namespace:    
Labels:       app.kubernetes.io/instance=vault
              app.kubernetes.io/managed-by=Tiller
              app.kubernetes.io/name=vault-agent-injector
Annotations:  flux.weave.works/antecedent: my-namespace:helmrelease/vault
API Version:  admissionregistration.k8s.io/v1beta1
Kind:         MutatingWebhookConfiguration
Metadata:
  Creation Timestamp:  2020-01-06T13:55:54Z
  Generation:          2
  Resource Version:    56445806
  Self Link:           /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/vault-agent-injector-cfg
  UID:                 4195285e-308c-11ea-8917-4201ac10000a
Webhooks:
  Client Config:
    Ca Bundle:  << REDACTED >>
    Service:
      Name:        vault-agent-injector-svc
      Namespace:   my-namespace
      Path:        /mutate
  Failure Policy:  Ignore
  Name:            vault.hashicorp.com
  Namespace Selector:
  Rules:
    API Groups:

    API Versions:
      v1
    Operations:
      CREATE
      UPDATE
    Resources:
      pods
  Side Effects:  Unknown
Events:          <none>

jasonodonnell commented 4 years ago

What version of Kube are you using?

Are you using a managed Kube service such as GKE/EKS or did you deploy your own?

ky0shiro commented 4 years ago

version is 1.13.11 and it is GKE

Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+70132b0f13", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"", BuildDate:"1970-01-01T00:00:00Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-gke.14", GitCommit:"56d89863d1033f9668ddd6e1c1aea81cd846ef88", GitTreeState:"clean", BuildDate:"2019-11-07T19:12:22Z", GoVersion:"go1.12.11b4", Compiler:"gc", Platform:"linux/amd64"}

jasonodonnell commented 4 years ago

@ky0shiro Interesting, our acceptance testing runs on a GKE cluster and is working fine. What you showed me looks correct but the request doesn't seem to make it to the injector. Do you have access to the Kube apiserver logs? I wonder if an error can be found there when Kube tries to contact the webhook.

jasonodonnell commented 4 years ago

@ky0shiro Can you also provide me the output of the following command by execing into the Vault Injector container?

cat /etc/resolv.conf

ky0shiro commented 4 years ago

@jasonodonnell Here is /etc/resolv.conf

nameserver 10.210.0.10
search my-namespace.svc.cluster.local svc.cluster.local cluster.local c.my-project.internal google.internal
options ndots:5

ky0shiro commented 4 years ago

@jasonodonnell logs from /logs/kube-apiserver.log (only lines containing injector word) injector.txt

popamatei commented 4 years ago

I've upgraded my vault helm chart from 0.2.1 to 0.3.3 on a GKE cluster - everything was working fine before since I was using the vault-agent and consul-template sidecar containers to render the secrets on the pod.

Now that I've upgraded, I can't get the vault-k8s to work. Is there any chance we're somehow ending up with a development version of hashicorp/vault-k8s:v0.1.2?

I'm in the exact same situation as ky0shiro and looking at the docker files and the vault-server-agent-injector container I end up with, it seems it's loading up the development version:

$ ps xa
PID   USER     TIME  COMMAND
    1 vault     0:14 /vault-k8s agent-inject 2>&1
   62 vault     0:00 sh
   74 vault     0:00 ps xa
/ $
$ /vault-k8s --version
0.1.2-dev
/ $ curl
sh: curl: not found
/ $

Are these the way they should be? Maybe we're not seeing any connections on the vault-server-agent-injector because we're missing the certs altogether and the api server can't actually connect.

Thanks

jasonodonnell commented 4 years ago

Hi @mateipopa, these are the correct builds. Release engineering had not completed the official build pipeline, however, they are being built internally using the dev builds.

One thing you might investigate are firewall rules on your GKE nodes. We've seen similar issues with injection due to 8080 being blocked: https://github.com/hashicorp/vault-k8s/issues/46

popamatei commented 4 years ago

Hello @jasonodonnell, thanks for pointing me in the right direction. Although I didn't get any errors in stackdriver, connections were indeed blocked by the firewall. Adding a rule to allow traffic from the master to the worker nodes solved the problem and requests now reach the injector. Thanks again!

Kampe commented 4 years ago

Having this same issue with GKE, opening port to my apiserver 8080 did not do the trick for me.

mikemowgli commented 4 years ago

I have the same issue:

I am running Vault chart v0.4.0
I verified with "openssl s_client -servername vault-agent-injector-svc.vault.svc -connect vault-agent-injector-svc.vault.svc:443" that the vault-agent-injector exposes the custom certificate and intermediate CA I provided
I verified that the caBundle of the MutatingWebhookConfiguration is the CA that issued the intermediate CA
I have the same log line in my vault-agent-monitor I followed this guide: https://github.com/hashicorp/vault-guides/tree/bd7eaa007d9124f87549986b070bbe19315895bb/operations/provision-vault/kubernetes/minikube/vault-agent-sidecar
I checked the logs of the Kubernetes API server and the Controller while recreating alternately the MutatingWebhookConfiguration, the vault-agent-injector pod, and the consumer app pod, but there is nothing related to the agent injector.
I changed AGENT_INJECT_LOG_LEVEL to 'debug', with no effect
Even changing the MutatingWebhookConfiguration failurePolicy from 'Ignore' to 'Fail' didn't prevent the consumer app pod from being started.

I wonder if the problem is between Kubernetes not contacting the webhook, or the webhook not contacting vault.

How can I troubleshoot further?

Kampe commented 4 years ago

Ensure all the components of the vault-injector are installed the the same namespace where you're looking to retrieve your secrets

mikemowgli commented 4 years ago

@Kampe , this would mean that in any namespace I'd like to fetch secrets from, I'd have to deploy a new vault-injector components. I can test your suggestion, but it can not be the official recommended solution, right?

Kampe commented 4 years ago

Check out https://github.com/hashicorp/vault-k8s/issues/15#issuecomment-591749580

pravargauba commented 4 years ago

@ky0shiro @jasonodonnell any luck yet? I am facing exactly the same problem @ky0shiro described. Double verified everything. After applying patch annotation to the deployment, again 1 container got spawned on the new pod, and i was hoping it to be 2.(the second one to be vault sidecar one). Is this problem on specific vault charts?

tvoran commented 4 years ago

A lot of these issues sound like what we've seen happen with private GKE clusters, for example: https://github.com/hashicorp/vault-helm/issues/214#issuecomment-592702596

So if that matches your setup, please try adding a firewall rule to allow the master to access 8080 on the nodes: https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules

If it doesn't, then it would help to know where your k8s cluster is running and how it's configured. If the configurations are too varied we might need to break this up into separate issues for clarity. Cheers!

mikemowgli commented 4 years ago

I found the issue: as I'm running on OpenShift 3.11 (Kubernetes 1.11), the API config had to be changed so it supports admission controllers.

    MutatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        disable: false
        kind: DefaultAdmissionConfig
    ValidatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        disable: false
        kind: DefaultAdmissionConfig

This block must be present in the master-config.yml in the section admissionConfig.pluginConfig. After restarting the apiserver, the webhook started to kick in. But the sidecar was still not injected, because of some permission issues. Granting the consumer app's service account cluster-admin permissions or access to the privileged SCC (equivalent of PSP) helped, but then also introduces other security issues.

pravargauba commented 4 years ago

A lot of these issues sound like what we've seen happen with private GKE clusters, for example: hashicorp/vault-helm#214 (comment)

So if that matches your setup, please try adding a firewall rule to allow the master to access 8080 on the nodes: https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules

If it doesn't, then it would help to know where your k8s cluster is running and how it's configured. If the configurations are too varied we might need to break this up into separate issues for clarity. Cheers!

This worked like a charm!!! Thanks @tvoran

jdbohrman commented 4 years ago

I'm experiencing this as well. Like @Kampe I updated the firewall to no avail. I'm getting logs almost exactly like @ky0shiro.

I'm on GKE as well.....and I'm beginning to see a pattern

h0x91b-wix commented 4 years ago

My 2 cents.

It happening to me when GKE replace node (upgrade/maintenance) and in my case cluster is public.

h0x91b-wix commented 4 years ago

Same situation right now:

❯ kubectl get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
gke-ero-cluster-ero-node-pool-2cf384d7-79b3   Ready    <none>   62m   v1.16.8-gke.15
gke-ero-cluster-ero-node-pool-6430b315-un3c   Ready    <none>   20h   v1.16.8-gke.15
gke-ero-cluster-ero-node-pool-b00f5513-o3c7   Ready    <none>   21h   v1.16.8-gke.15

Google replaced 62 min ago one of nodes and then:

❯ kubectl get pods
NAME                                   READY   STATUS             RESTARTS   AGE
ero-app-d85b548c4-bfk9s                2/2     Running            0          21h
ero-app-d85b548c4-df864                0/1     CrashLoopBackOff   25         62m
ero-app-d85b548c4-jv9b5                0/1     CrashLoopBackOff   25         62m
ero-app-d85b548c4-nr7b4                0/1     CrashLoopBackOff   25         62m
ero-app-d85b548c4-q5q4j                0/1     CrashLoopBackOff   25         62m
ero-app-d85b548c4-x4zbj                2/2     Running            0          21h

To recover it I need to scale deployment to 2 and then back to 6, this happen on each kuberenets node replacement. This bug happens almost every day, tell me if you want me to run something next time when it occurs...

h0x91b-wix commented 4 years ago

The reason for such behaviour - https://github.com/hashicorp/vault-helm/issues/238

vault-agent-injector was also recreated and all rescheduled pods are rescheduled without vault container inside.

rchenzheng commented 4 years ago

I'm using the latest version of the vault-helm chart 0.6.0 and this issue still seems to be happening kubernetes: v1.15.11-gke.5

However, unlike @ky0shiro I am getting the handlers as I whitelisted 8080

│ 2020-06-22T18:47:00.891Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s                                                                                                          │
│ 2020-06-22T18:47:00.893Z [DEBUG] handler: checking if should inject agent..

rchenzheng commented 4 years ago

Looks like I had to put the annotations in the right spot.

Annotations: https://www.vaultproject.io/docs/platform/k8s/injector/examples

HariNarayananMohan commented 4 years ago

I found the issue: as I'm running on OpenShift 3.11 (Kubernetes 1.11), the API config had to be changed so it supports admission controllers.
    MutatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        disable: false
        kind: DefaultAdmissionConfig
    ValidatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        disable: false
        kind: DefaultAdmissionConfig
This block must be present in the master-config.yml in the section admissionConfig.pluginConfig. After restarting the apiserver, the webhook started to kick in. But the sidecar was still not injected, because of some permission issues. Granting the consumer app's service account cluster-admin permissions or access to the privileged SCC (equivalent of PSP) helped, but then also introduces other security issues.

@mikemowgli Thanks for the info. I added those block of lines in the master-config.yaml, but my openshift cluster says is not enabled in its logs. can you tell me how did u enable it ?

I0719 07:28:30.711521       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I0719 07:28:31.408404       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"
I0719 07:28:32.187811       1 register.go:151] Admission plugin MutatingAdmissionWebhook is not enabled.  It will not be started.
I0719 07:28:32.361736       1 plugins.go:84] Registered admission plugin "MutatingAdmissionWebhook"

HariNarayananMohan commented 4 years ago

I was able to make it work, I had to restart few services in the master node after making these changes. I followed this link https://access.redhat.com/solutions/3869391

bienkma commented 4 years ago

Open 8080,443 ports in your VPC's Firewall resolving the problem. It work fine for me.

pksurferdad commented 3 years ago

I've been fighting this same issue for several days as well where agent injector does not finish intializing and the container does not start. I'm ruuming on an AWS eks cluster and if it's a port issue between the control plane and the nodes, anyone know how to enable 8080 on AWS eks?

pksurferdad commented 3 years ago

following these instructions from AWS, i added in inbound rule to the node security group to allow all TCP traffic ports 0 - 65535 from the control plane security group, but no luck with the sample deployment initializing. below is some log data from vault and the vault injector as well as a kubectl describe of the sample deployment. could definitely use some guidance on how to troubleshoot this further

Vault Logs

identity: creating a new entity: alias="id:"892e70b6-508d-6b11-0fe7-4e3d273cb868" canonical_id:"fc7332af-4001-bc99-9252-eaebaa41b826" mount_type:"kubernetes" mount_accessor:"auth_kubernetes_edb6b310" mount_path:"auth/kubernetes/" metadata:{key:"service_account_name" value:"vault-auth"} metadata:{key:"service_account_namespace" value:"default"} metadata:{key:"service_account_secret_name" value:"vault-auth-token-jl7h4"} metadata:{key:"service_account_uid" value:"1a40077e-f3ae-4953-bc6d-9f742d0278d2"} name:"1a40077e-f3ae-4953-bc6d-9f742d0278d2" creation_time:{seconds:1602346759 nanos:237072421} last_update_time:{seconds:1602346759 nanos:237072421} namespace_id:"root""

Injector Logs

Registering telemetry path on "/metrics"
2020-10-10T16:18:12.311Z [INFO]  handler: Starting handler..
Listening on ":8080"...
Updated certificate bundle received. Updating certs...
2020-10-10T16:19:14.074Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=30s
2020-10-10T16:19:14.076Z [DEBUG] handler: checking if should inject agent..
2020-10-10T16:19:14.076Z [DEBUG] handler: checking namespaces..
2020-10-10T16:19:14.076Z [DEBUG] handler: setting default annotations..
2020-10-10T16:19:14.077Z [DEBUG] handler: creating new agent..
2020-10-10T16:19:14.077Z [DEBUG] handler: validating agent configuration..
2020-10-10T16:19:14.077Z [DEBUG] handler: creating patches for the pod..

kubectl describe from the sample deployment

Name:           app-d6d9b9755-2l856
Namespace:      default
Priority:       0
Node:           ip-192-168-26-157.ec2.internal/192.168.26.157
Start Time:     Sat, 10 Oct 2020 11:19:14 -0500
Labels:         app=vault-agent-demo
                pod-template-hash=d6d9b9755
Annotations:    kubernetes.io/psp: eks.privileged
                vault.hashicorp.com/agent-inject: true
                vault.hashicorp.com/agent-inject-secret-poc-secret: secrets/dev/poc-secret
                vault.hashicorp.com/agent-inject-status: injected
                vault.hashicorp.com/role: app-user
                vault.hashicorp.com/tls-skip-verify: true
Status:         Pending
IP:             192.168.9.35
Controlled By:  ReplicaSet/app-d6d9b9755
Init Containers:
  vault-agent-init:
    Container ID:  docker://6ab5f0688f5dea6a416fa5ad8fc5395675ebba37ea1f54a1b4f7e1b56d4cb768
    Image:         vault:1.5.2
    Image ID:      docker-pullable://vault@sha256:9aa46d9d9987562013bfadce166570e1705de619c9ae543be7c61953f3229923
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
    Args:
      echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
    State:          Running
      Started:      Sat, 10 Oct 2020 11:19:19 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  128Mi
    Requests:
      cpu:     250m
      memory:  64Mi
    Environment:
      VAULT_LOG_LEVEL:  info
      VAULT_CONFIG:     eyJhdXRvX2F1dGgiOnsibWV0aG9kIjp7InR5cGUiOiJrdWJlcm5ldGVzIiwibW91bnRfcGF0aCI6ImF1dGgva3ViZXJuZXRlcyIsImNvbmZpZyI6eyJyb2xlIjoiYXBwLXVzZXIifX0sInNpbmsiOlt7InR5cGUiOiJmaWxlIiwiY29uZmlnIjp7InBhdGgiOiIvaG9tZS92YXVsdC8udmF1bHQtdG9rZW4ifX1dfSwiZXhpdF9hZnRlcl9hdXRoIjp0cnVlLCJwaWRfZmlsZSI6Ii9ob21lL3ZhdWx0Ly5waWQiLCJ2YXVsdCI6eyJhZGRyZXNzIjoiaHR0cHM6Ly92YXVsdC52YXVsdC5zdmM6ODIwMCIsInRsc19za2lwX3ZlcmlmeSI6dHJ1ZX0sInRlbXBsYXRlIjpbeyJkZXN0aW5hdGlvbiI6Ii92YXVsdC9zZWNyZXRzL3BvYy1zZWNyZXQiLCJjb250ZW50cyI6Int7IHdpdGggc2VjcmV0IFwic2VjcmV0cy9kZXYvcG9jLXNlY3JldFwiIH19e3sgcmFuZ2UgJGssICR2IDo9IC5EYXRhIH19e3sgJGsgfX06IHt7ICR2IH19XG57eyBlbmQgfX17eyBlbmQgfX0iLCJsZWZ0X2RlbGltaXRlciI6Int7IiwicmlnaHRfZGVsaW1pdGVyIjoifX0ifV19
    Mounts:
      /home/vault from home-init (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from vault-auth-token-jl7h4 (ro)
      /vault/secrets from vault-secrets (rw)
Containers:
  app:
    Container ID:   
    Image:          jweissig/app:0.0.1
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from vault-auth-token-jl7h4 (ro)
      /vault/secrets from vault-secrets (rw)
  vault-agent:
    Container ID:  
    Image:         vault:1.5.2
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
    Args:
      echo ${VAULT_CONFIG?} | base64 -d > /home/vault/config.json && vault agent -config=/home/vault/config.json
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  128Mi
    Requests:
      cpu:     250m
      memory:  64Mi
    Environment:
      VAULT_LOG_LEVEL:  info
      VAULT_CONFIG:     eyJhdXRvX2F1dGgiOnsibWV0aG9kIjp7InR5cGUiOiJrdWJlcm5ldGVzIiwibW91bnRfcGF0aCI6ImF1dGgva3ViZXJuZXRlcyIsImNvbmZpZyI6eyJyb2xlIjoiYXBwLXVzZXIifX0sInNpbmsiOlt7InR5cGUiOiJmaWxlIiwiY29uZmlnIjp7InBhdGgiOiIvaG9tZS92YXVsdC8udmF1bHQtdG9rZW4ifX1dfSwiZXhpdF9hZnRlcl9hdXRoIjpmYWxzZSwicGlkX2ZpbGUiOiIvaG9tZS92YXVsdC8ucGlkIiwidmF1bHQiOnsiYWRkcmVzcyI6Imh0dHBzOi8vdmF1bHQudmF1bHQuc3ZjOjgyMDAiLCJ0bHNfc2tpcF92ZXJpZnkiOnRydWV9LCJ0ZW1wbGF0ZSI6W3siZGVzdGluYXRpb24iOiIvdmF1bHQvc2VjcmV0cy9wb2Mtc2VjcmV0IiwiY29udGVudHMiOiJ7eyB3aXRoIHNlY3JldCBcInNlY3JldHMvZGV2L3BvYy1zZWNyZXRcIiB9fXt7IHJhbmdlICRrLCAkdiA6PSAuRGF0YSB9fXt7ICRrIH19OiB7eyAkdiB9fVxue3sgZW5kIH19e3sgZW5kIH19IiwibGVmdF9kZWxpbWl0ZXIiOiJ7eyIsInJpZ2h0X2RlbGltaXRlciI6In19In1dfQ==
    Mounts:
      /home/vault from home-sidecar (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from vault-auth-token-jl7h4 (ro)
      /vault/secrets from vault-secrets (rw)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  vault-auth-token-jl7h4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vault-auth-token-jl7h4
    Optional:    false
  home-init:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  home-sidecar:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  vault-secrets:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      Memory
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                                     Message
  ----    ------     ----       ----                                     -------
  Normal  Scheduled  <unknown>  default-scheduler                        Successfully assigned default/app-d6d9b9755-2l856 to ip-192-168-26-157.ec2.internal
  Normal  Pulling    12m        kubelet, ip-192-168-26-157.ec2.internal  Pulling image "vault:1.5.2"
  Normal  Pulled     12m        kubelet, ip-192-168-26-157.ec2.internal  Successfully pulled image "vault:1.5.2"
  Normal  Created    12m        kubelet, ip-192-168-26-157.ec2.internal  Created container vault-agent-init
  Normal  Started    12m        kubelet, ip-192-168-26-157.ec2.internal  Started container vault-agent-init

jasonodonnell commented 3 years ago

@pksurferdad This all looks good, it did indeed inject. You need to check the vault-agent-init logs to see what's wrong (likely permissions with your Vault role).

kubectl logs <your app pod> -c vault-agent-init

I see you're trying to get a KV secret. Which version of KV is this (1 or 2)? If you're not sure provide the output from:

vault secrets list -detailed

Additionally you should provide the policy that you're attaching to app-user so I can verify you have the correct permissions.

If you're getting login, permission denied errors, there could be something wrong from Vault's end (like the K8s auth method wasn't configured correctly). Please provide the Vault server logs.

pksurferdad commented 3 years ago

thx for responding @jasonodonnell. well, the init logs were certainly helpful and they led me to my problem, having an incorrect secret path. thx so much, it's working now. i am going to undo some of the AWS networking changes i made to see if they were even necessary.

vault-agent-init logs

2020-10-10T18:27:17.282Z [INFO]  sink.file: creating file sink
2020-10-10T18:27:17.282Z [INFO]  sink.file: file sink configured: path=/home/vault/.vault-token mode=-rw-r-----
2020-10-10T18:27:17.283Z [INFO]  auth.handler: starting auth handler
2020-10-10T18:27:17.283Z [INFO]  auth.handler: authenticating
2020-10-10T18:27:17.283Z [INFO]  template.server: starting template server
2020/10/10 18:27:17.283255 [INFO] (runner) creating new runner (dry: false, once: false)
2020-10-10T18:27:17.283Z [INFO]  sink.server: starting sink server
2020/10/10 18:27:17.283831 [WARN] (clients) disabling vault SSL verification
2020/10/10 18:27:17.283843 [INFO] (runner) creating watcher
2020-10-10T18:27:17.297Z [INFO]  auth.handler: authentication successful, sending token to sinks
2020-10-10T18:27:17.297Z [INFO]  auth.handler: starting renewal process
2020-10-10T18:27:17.297Z [INFO]  sink.file: token written: path=/home/vault/.vault-token
2020-10-10T18:27:17.297Z [INFO]  sink.server: sink server stopped
2020-10-10T18:27:17.297Z [INFO]  sinks finished, exiting
2020-10-10T18:27:17.297Z [INFO]  template.server: template server received new token
2020/10/10 18:27:17.297652 [INFO] (runner) stopping
2020/10/10 18:27:17.297677 [INFO] (runner) creating new runner (dry: false, once: false)
2020/10/10 18:27:17.297800 [WARN] (clients) disabling vault SSL verification
2020/10/10 18:27:17.297825 [INFO] (runner) creating watcher
2020/10/10 18:27:17.297863 [INFO] (runner) starting
2020-10-10T18:27:17.306Z [INFO]  auth.handler: renewed auth token
2020/10/10 18:27:17.314963 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 1 after "250ms")
2020/10/10 18:27:17.572730 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 2 after "500ms")
2020/10/10 18:27:18.080373 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 3 after "1s")
2020/10/10 18:27:19.088366 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 4 after "2s")
2020/10/10 18:27:21.096020 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 5 after "4s")
2020/10/10 18:27:25.104668 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 6 after "8s")
2020/10/10 18:27:33.112358 [WARN] (view) vault.read(secrets/dev/poc-secret): no secret exists at secrets/dev/poc-secret (retry attempt 7 after "16s")

pksurferdad commented 3 years ago

i also confirmed that the AWS security group changes i made here https://github.com/hashicorp/vault-k8s/issues/32#issuecomment-706575682 were not necessary. looks like the default AWS EKS cluster deployment using eksctl doesn't require any additional inbound or outbound security group rules.

aRestless commented 3 years ago

I found the issue: as I'm running on OpenShift 3.11 (Kubernetes 1.11), the API config had to be changed so it supports admission controllers.
    MutatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        disable: false
        kind: DefaultAdmissionConfig
    ValidatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        disable: false
        kind: DefaultAdmissionConfig
This block must be present in the master-config.yml in the section admissionConfig.pluginConfig. After restarting the apiserver, the webhook started to kick in. But the sidecar was still not injected, because of some permission issues. Granting the consumer app's service account cluster-admin permissions or access to the privileged SCC (equivalent of PSP) helped, but then also introduces other security issues.

I too am running an OpenShift 3.11. The resulting error @mikemowgli hinted at here that comes up if the privileged SCC isn't set is Error creating: pods "<podname>" is forbidden: unable to validate against any pod security policy: [].

Adding the privileged SCC to the pod's service account worked for me, but that's nothing for production. The cluster-admin permission implies the privileged SCC, this is why adding that role also works.

Upon further investigation I'm convinced this relates to https://github.com/kubernetes/kubernetes/issues/65716 and may have been changed in newer Kubernetes versions. The way I understand it, there are multiple hooks being called before Kubernetes spins up the pod, and the last in that row is the hook that checks against the Pod Security Policy, meaning answering the question "is this pod allowed to run in this configuration that may have been altered by the other hooks?".

Apparently on OpenShift 3.11 / Kubernetes 1.11, while or after executing the MutatingAdmissionWebhook, the securityContext of the resulting pod is lost or not available. This also explains why the list of pod security policies is empty ([]) in the error message.

Knowing from @mikemowgli's answer that it could be fixed with an SCC, I have played around and to avoid the error, the requiredDropCapabilities property in the SCC must be empty. It is not a specific entry in the list that makes the check fail, I think it is that if there is any entry in the list, a check is being executed that is then missing the aforementioned context.

I was able to copy the restricted SCC, set requiredDropCapabilities: [], assign the SCC to my pod's service account and the pod with the injector came up.

This is not as bad as assigning the privileged SCC, but it certainly has its security implications and I'm not sure yet if that's okay for production. The capabilities being dropped by default are SET_UID, SET_GID, MKNOD, and KILL.

If anyone could shed more light on this that would be great. Otherwise there's probably nothing else left than upgrading to OpenShift 4.x to use the vault injector.

agates4 commented 3 years ago

hey @jasonodonnell , I see you are a great resource for updating configurations

Intention

I am working on getting all the vault setup sets (all that is left is to include the injector), fully automated by terraform. https://github.com/sethvargo/vault-on-gke/pull/98

Problem

The above PR shows my changes to getting the vault-injector up and running via this terraform project.

I added this firewall rule: https://github.com/sethvargo/vault-on-gke/pull/98/files#diff-833c22bd299aef6aabfe1b427e9ee5f6fe6ca27f9f54ef81f2fb9fb32a5ddb8dR389-R406 which allows mutating requests to come into the sidecar injector:

2021-08-13T15:33:42.096Z [INFO]  handler: Request received: Method=POST URL=/mutate?timeout=10s
2021-08-13T15:33:42.105Z [DEBUG] handler: checking if should inject agent..
2021-08-13T15:33:42.106Z [DEBUG] handler: checking namespaces..
2021-08-13T15:33:42.106Z [DEBUG] handler: setting default annotations..
2021-08-13T15:33:42.106Z [DEBUG] handler: creating new agent..
2021-08-13T15:33:42.107Z [DEBUG] handler: validating agent configuration..
2021-08-13T15:33:42.107Z [DEBUG] handler: creating patches for the pod..

however, no patches were made to the pod

Annotations:  cni.projectcalico.org/podIP: 10.0.94.28/32
              cni.projectcalico.org/podIPs: 10.0.94.28/32
              vault.hashicorp.com/agent-inject: true
              vault.hashicorp.com/agent-inject-secret-foo: secret/foo
              vault.hashicorp.com/role: internal-app

^ the pod still has the same annotations, no additional annotations added, and no secrets injected.

To replicate

on this pr, https://github.com/sethvargo/vault-on-gke/pull/98, clone locally, run the READMD instructions

then, run these CLI commands (after the READMD export env variables instructions)

# enable secrets, add a secret, write a new policy
vault secrets enable -path=secret -version=2 kv
vault kv put secret/foo a=b
vault policy write internal-app - <<EOH
path "secret/*" {
  capabilities = ["read"]
}
EOH

# get into the vault container
gcloud container clusters get-credentials vault --region us-central1
kubectl exec -n vault -it vault-0 --container vault /bin/sh

-- inside container --
# enable service to service auth via kubernetes
export VAULT_TOKEN=“put in master token”
vault auth enable kubernetes
vault write auth/kubernetes/config \
    token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
    kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443" \
    kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
--

# add a specific 
vault write auth/kubernetes/role/internal-app \
    bound_service_account_names=internal-app \
    bound_service_account_namespaces=vault \
    policies=internal-app \
    ttl=24h

and then, I simply deploy this helm app which defines the annotations https://github.com/agates4/sample-vault-helm-template inside the repo, helm install python-service .

then I check all the logs and see the problem I first described above ^

Hypothesis

maybe the vault address I listed for the vault injector app is wrong
maybe the vault injector is blocked by firewall rules to the vault app
random config mess ups?

My ask

@jasonodonnell, do you think you can help me update this terraform project to be working fully out of the box? can you point me in the right directions?

Thank you!

agates4 commented 3 years ago

UPDATE

The problem was the MutatingWebhookConfiguration I created in terraform was using

admission_review_versions = ["v1", "v1beta"]

when it should be using

admission_review_versions = ["v1beta1"]

thanks to https://githubmemory.com/repo/hashicorp/vault-k8s/issues?cursor=Y3Vyc29yOnYyOpK5MjAyMS0wNS0yMVQxNjozNjoyOSswODowMM41g65h&pagination=next&page=2

now the sidecar injects a vault-init container within the deployed python-starter pod.

however I am getting this error on the init container:

error authenticating: error="context deadline exceeded" backoff=1s

and this error means the init pod is stuck forever and the python-starter app is never fully deployed or ready ..

diving into this..

agates4 commented 3 years ago

Alright folks -

I codified the entire process via terraform on how to get a sidecar injector working, even on external clusters. https://github.com/sethvargo/vault-on-gke/pull/98

^ Fully documented in this PR 👍

I hope this helps someone! It took me quite a bit of diving in to get this fully working out of the box!

pcgeek86 commented 2 years ago

I'm having the same symptom, however in my case, the MutatingWebhookConfiguration resource is never getting created by the Helm Chart release.

PS > kubectl get mutatingwebhookconfigurations
NAME                                    WEBHOOKS   AGE
linkerd-proxy-injector-webhook-config   1          46d
linkerd-tap-injector-webhook-config     1          46d
webhook.pipeline.tekton.dev             1          6d20h

As you can see from the above output, I ran kubectl from PowerShell, and there is no mutating webhook for Vault, even though I installed it with the Helm Chart.

EDIT: The issue, at least in my case, was that I had installed the Vault Helm Chart in different namespaces, and had deleted one of them. That caused the MutatingWebhookConfiguration to be deleted, even though I still had a valid Helm release in a different namespace.

tomhjp commented 2 years ago

I'm going to close this as it seems the original issue is resolved. Please feel free to post in our discuss forum if anyone is still having issues debugging their deployment: https://discuss.hashicorp.com/c/vault/30

hashicorp / vault-k8s