Role not being picked up by a container in an annotated pod #225

Open jtafurth opened 5 years ago

A gitlab runner pod is running on a kubernetes cluster running kube2iam which spins builds pods with two containers "build" and "helper".

The "build" container AWS calls are intercepted correctly and the correct role is assumed but the "helper" container does not get intercepted or the annotation is not recognized and then kube2iam seems to default to the default role. Eventually this causes the cache functionality of gitlab to return a 403 due to the incorrect role being assumed.

Has anyone experienced this issue before?

kube2iam logs for build pod:

time="2019-08-07T13:36:02Z" level=debug msg="retrieved credentials from sts endpoint: https://sts.eu-west-1.amazonaws.com" ns.name=gitlab pod.iam.role="arn:aws:iam::XXXXXXXXXX:role/gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV" req.method=GET req.path=/latest/meta-data/iam/security-credentials/gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV req.remote=10.137.4.97

kube2iam logs for helper pod:

time="2019-08-07T13:37:29Z" level=warning msg="Using fallback role for IP 10.137.4.156"
time="2019-08-07T13:37:29Z" level=debug msg="retrieved credentials from sts endpoint: https://sts.eu-west-1.amazonaws.com" ns.name=gitlab pod.iam.role="arn:aws:iam::XXXXXXXXXX:role:role/kube2iam-default" req.method=GET req.path=/latest/meta-data/iam/security-credentials/kube2iam-default req.remote=10.137.4.156
time="2019-08-07T13:37:29Z" level=info msg="GET /latest/meta-data/iam/security-credentials/kube2iam-default (200) took 68986.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/kube2iam-default req.remote=10.137.4.156 res.duration=68986 res.status=200

The pod configuration:

`Name:         runner-xqul42y4-project-149-concurrent-0b27w7
Namespace:    gitlab
Priority:     0
Node:         ip-10-137-4-189.eu-west-1.compute.internal/10.137.4.189
Start Time:   Wed, 07 Aug 2019 15:35:14 +0200
Labels:       pod=runner-xqul42y4-project-149-concurrent-0
Annotations:  iam.amazonaws.com/role: arn:aws:iam::296095062504:role/gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.137.4.97
Containers:
  build:
    Container ID:  docker://13a1ef0798db732550e39035b62e77b866ffb4338696b2d08b696fa2c3344122
    Image:         XXXXXXXXXX:role.dkr.ecr.eu-west-1.amazonaws.com/platform/frontend-builder:7f2369e9-59903
    Image ID:      docker-pullable://XXXXXXXXXX:role.dkr.ecr.eu-west-1.amazonaws.com/platform/frontend-builder@sha256:086fbd257efba5e7fcbe432ced3ef36d8dc2424ae3e897d3f47f84572d382df9
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      if [ -x /usr/local/bin/bash ]; then
        exec /usr/local/bin/bash 
      elif [ -x /usr/bin/bash ]; then
        exec /usr/bin/bash 
      elif [ -x /bin/bash ]; then
        exec /bin/bash 
      elif [ -x /usr/local/bin/sh ]; then
        exec /usr/local/bin/sh 
      elif [ -x /usr/bin/sh ]; then
        exec /usr/bin/sh 
      elif [ -x /bin/sh ]; then
        exec /bin/sh 
      elif [ -x /busybox/sh ]; then
        exec /busybox/sh 
      else
        echo shell not found
        exit 1
      fi

    State:          Running
      Started:      Wed, 07 Aug 2019 15:35:15 +0200
    Ready:          True
    Restart Count:  0
    Environment: REMOVED
    Mounts:
      /builds from repo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lrtn9 (ro)
  helper:
    Container ID:  docker://a2ac54aa03bccab1010d0595b0ad6958a2a448f44040663020f0466a7172674a
    Image:         gitlab/gitlab-runner-helper:x86_64-de7731dd
    Image ID:      docker-pullable://gitlab/gitlab-runner-helper@sha256:a68dc1b0468d5d01b2b70b85aa90acfbb13434e0ae84b1fea5bedccaa9847301
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      if [ -x /usr/local/bin/bash ]; then
        exec /usr/local/bin/bash 
      elif [ -x /usr/bin/bash ]; then
        exec /usr/bin/bash 
      elif [ -x /bin/bash ]; then
        exec /bin/bash 
      elif [ -x /usr/local/bin/sh ]; then
        exec /usr/local/bin/sh 
      elif [ -x /usr/bin/sh ]; then
        exec /usr/bin/sh 
      elif [ -x /bin/sh ]; then
        exec /bin/sh 
      elif [ -x /busybox/sh ]; then
        exec /busybox/sh 
      else
        echo shell not found
        exit 1
      fi

    State:          Running
      Started:      Wed, 07 Aug 2019 15:35:15 +0200
    Ready:          True
    Restart Count:  0
    Environment: REMOVED
    Mounts:
      /builds from repo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lrtn9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  repo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-lrtn9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lrtn9
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                                                 Message
  ----    ------     ----  ----                                                 -------
  Normal  Scheduled  100s  default-scheduler                                    Successfully assigned gitlab/runner-xqul42y4-project-149-concurrent-0b27w7 to ip-10-137-4-189.eu-west-1.compute.internal
  Normal  Pulled     100s  kubelet, ip-10-137-4-189.eu-west-1.compute.internal  Container image "296095062504.dkr.ecr.eu-west-1.amazonaws.com/platform/frontend-builder:7f2369e9-59903" already present on machine
  Normal  Created    100s  kubelet, ip-10-137-4-189.eu-west-1.compute.internal  Created container
  Normal  Started    99s   kubelet, ip-10-137-4-189.eu-west-1.compute.internal  Started container
  Normal  Pulled     99s   kubelet, ip-10-137-4-189.eu-west-1.compute.internal  Container image "gitlab/gitlab-runner-helper:x86_64-de7731dd" already present on machine
  Normal  Created    99s   kubelet, ip-10-137-4-189.eu-west-1.compute.internal  Created container
  Normal  Started    99s   kubelet, ip-10-137-4-189.eu-west-1.compute.internal  Started container
`

As . you can see the annotation is there and the pod has two containers.

The IP 10.137.4.156 in the logs corresponds to the parent runner IP (the one that launches the children pod with the 2 containers).

Name:           gitlab-runner-shared-86ddbfdd59-fv9q9
Namespace:      gitlab
Priority:       0
Node:           ip-10-137-4-189.eu-west-1.compute.internal/10.137.4.189
Start Time:     Wed, 07 Aug 2019 14:38:40 +0200
Labels:         app.kubernetes.io/app=gitlab-runner-shared
                pod-template-hash=86ddbfdd59
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/port: 9252
                prometheus.io/scrape: true
Status:         Running
IP:             10.137.4.156
Controlled By:  ReplicaSet/gitlab-runner-shared-86ddbfdd59
Init Containers:
  configure:
    Container ID:  docker://53b9ee3e1ff8a5209025912441a0157586548aa39f6249945a6677c1f91500fa
    Image:         gitlab/gitlab-runner:alpine
    Image ID:      docker-pullable://gitlab/gitlab-runner@sha256:efdf04d68586fa6a203b25354f7eafab37c2ef2ae7df2fe22a944fe6d0662085
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      /config/configure
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 07 Aug 2019 14:38:41 +0200
      Finished:     Wed, 07 Aug 2019 14:38:41 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      CI_SERVER_URL:                      https://gitlab.leaseplan.io
      CLONE_URL:                          
      RUNNER_EXECUTOR:                    kubernetes
      REGISTER_LOCKED:                    false
      RUNNER_TAG_LIST:                    k8s, shared
      KUBERNETES_IMAGE:                   alpine:latest
      KUBERNETES_NAMESPACE:               gitlab
      KUBERNETES_CPU_LIMIT:               
      KUBERNETES_MEMORY_LIMIT:            
      KUBERNETES_CPU_REQUEST:             
      KUBERNETES_MEMORY_REQUEST:          
      KUBERNETES_SERVICE_ACCOUNT:         
      KUBERNETES_SERVICE_CPU_LIMIT:       
      KUBERNETES_SERVICE_MEMORY_LIMIT:    
      KUBERNETES_SERVICE_CPU_REQUEST:     
      KUBERNETES_SERVICE_MEMORY_REQUEST:  
      KUBERNETES_HELPER_CPU_LIMIT:        
      KUBERNETES_HELPER_MEMORY_LIMIT:     
      KUBERNETES_HELPER_CPU_REQUEST:      
      KUBERNETES_HELPER_MEMORY_REQUEST:   
      KUBERNETES_HELPER_IMAGE:            
      KUBERNETES_PULL_POLICY:             
      CACHE_TYPE:                         s3
      CACHE_PATH:                         
      CACHE_SHARED:                       true
      CACHE_S3_SERVER_ADDRESS:            s3.amazonaws.com
      CACHE_S3_BUCKET_NAME:               gitlab-gitlabrunnerstack-5ixlxs1i4isf-cachebucket-1fod5sdo77cht
      CACHE_S3_BUCKET_LOCATION:           eu-west-1
      GITLAB_RUNNER_ROLE:                 gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV
      AWS_ACCOUNT:                        296095062504
      RUN_UNTAGGED:                       true
    Mounts:
      /config from scripts (ro)
      /init-secrets from init-runner-secrets (ro)
      /secrets from runner-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from gitlab-runner-shared-token-hwnnz (ro)
Containers:
  gitlab-runner:
    Container ID:  docker://b5f47b3805e91ae20d9a3661750d1f7a7619e2315237d9fa8de792e33ff4b530
    Image:         gitlab/gitlab-runner:alpine
    Image ID:      docker-pullable://gitlab/gitlab-runner@sha256:efdf04d68586fa6a203b25354f7eafab37c2ef2ae7df2fe22a944fe6d0662085
    Port:          9252/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
      /scripts/entrypoint
    State:          Running
      Started:      Wed, 07 Aug 2019 14:38:43 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/bash /scripts/check-live] delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:      exec [/usr/bin/pgrep gitlab.*runner] delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CI_SERVER_URL:                      https://gitlab.leaseplan.io
      CLONE_URL:                          
      RUNNER_EXECUTOR:                    kubernetes
      REGISTER_LOCKED:                    false
      RUNNER_TAG_LIST:                    k8s, shared
      KUBERNETES_IMAGE:                   alpine:latest
      KUBERNETES_NAMESPACE:               gitlab
      KUBERNETES_CPU_LIMIT:               
      KUBERNETES_MEMORY_LIMIT:            
      KUBERNETES_CPU_REQUEST:             
      KUBERNETES_MEMORY_REQUEST:          
      KUBERNETES_SERVICE_ACCOUNT:         
      KUBERNETES_SERVICE_CPU_LIMIT:       
      KUBERNETES_SERVICE_MEMORY_LIMIT:    
      KUBERNETES_SERVICE_CPU_REQUEST:     
      KUBERNETES_SERVICE_MEMORY_REQUEST:  
      KUBERNETES_HELPER_CPU_LIMIT:        
      KUBERNETES_HELPER_MEMORY_LIMIT:     
      KUBERNETES_HELPER_CPU_REQUEST:      
      KUBERNETES_HELPER_MEMORY_REQUEST:   
      KUBERNETES_HELPER_IMAGE:            
      KUBERNETES_PULL_POLICY:             
      CACHE_TYPE:                         s3
      CACHE_PATH:                         
      CACHE_SHARED:                       true
      CACHE_S3_SERVER_ADDRESS:            s3.amazonaws.com
      CACHE_S3_BUCKET_NAME:               gitlab-gitlabrunnerstack-5ixlxs1i4isf-cachebucket-1fod5sdo77cht
      CACHE_S3_BUCKET_LOCATION:           eu-west-1
      GITLAB_RUNNER_ROLE:                 gitlab-GitLabRunnerStack-5IXLXS-RunnerInstanceRole-SB6GE6XITAKV
      AWS_ACCOUNT:                        296095062504
      RUN_UNTAGGED:                       true
    Mounts:
      /home/gitlab-runner/.gitlab-runner from etc-gitlab-runner (rw)
      /scripts from scripts (rw)
      /secrets from runner-secrets (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from gitlab-runner-shared-token-hwnnz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  init-runner-secrets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          gitlab-runner-registration-token-shared-8c95cbthk6
    SecretOptionalName:  <nil>
  runner-secrets:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  etc-gitlab-runner:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      gitlab-runner-shared-42fg8df727
    Optional:  false
  gitlab-runner-shared-token-hwnnz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  gitlab-runner-shared-token-hwnnz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

Yeah i'm having similar situation with Jenkins + k8s agent. Sometimes the pod got the corrected assumed role, sometimes it doesn't. Not sure why

Also experiencing this issue. Most of the time the correct role gets assumed but it is intermittently falling back to the worker group role. If anyone needs any more info please feel free to reach out.

Based on our current finding, mostly the root cause is coming from the case when we use a random name for pod label. When we let Jenkins to generate the label randomly, the error rate (the case in which the pod is not able to get the correct credentials) was very high, up to 50%. After that, when we change to a static label for each Jenkins pod specs per job, the error rate was reduced dramatically.

It is not a complete solution, since we still encountered this problem, even just a small percentage. Usually normal workloads (like application deployments) are working fine because it can tolerate a few seconds failure in getting iam credentials before retry again. I guess there are nothing we can do about this since this is depends on kube2iam. For now we just retry the job when it fails.

jtblin / kube2iam

Role not being picked up by a container in an annotated pod #225