jtblin / kube2iam

kube2iam provides different AWS IAM roles for pods running on Kubernetes
BSD 3-Clause "New" or "Revised" License
1.97k stars 318 forks source link

kube2iam Race Condition? Role Trying To Assume Itself #280

Open better-sachin opened 3 years ago

better-sachin commented 3 years ago

We are running into an intermittent issue on our Kubernetes pods using kube2iam to provide IAM credentials to containers where the assumed role tries to assume itself

The first thing our pod does is decrypt SOPS secrets using SOPS

We are getting this error message while decrypting:

Failed to get the data key required to decrypt the SOPS file.
Group 0: FAILED
  arn:aws:kms:us-east-1:<account-id>:key/<uuid>: FAILED
    - | Error decrypting key: NoCredentialProviders: no valid
      | providers in chain. Deprecated.
      |     For verbose messaging see
      | aws.Config.CredentialsChainVerboseErrors

  arn:aws:kms:us-east-1:<account-id>:key/<uuid>: FAILED
    - | Error creating AWS session: Failed to assume role
      | "arn:aws:iam::<account-id>:role/service/<role-name>":
      | AccessDenied: User:
      | arn:aws:sts::<account-id>:assumed-role/<role-name>/<role-session-name>
      | is not authorized to perform: sts:AssumeRole on resource:
      | arn:aws:iam::<account-id>:role/service/<role-name>
      |     status code: 403, request id:
      | a401448b-6242-46d1-80d7-7e14396b4ad0
Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html

We have enabled kube2iam verbose logs and see these logs related to the pod having the error:

time="2020-09-11T09:00:04Z" level=debug msg="Pod OnAdd" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip= pod.status.phase=Pending
time="2020-09-11T09:00:05Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip= pod.status.phase=Pending
time="2020-09-11T09:00:08Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip= pod.status.phase=Pending
time="2020-09-11T09:00:08Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Running
time="2020-09-11T09:00:10Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T09:22:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T09:52:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T10:22:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T10:52:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T11:22:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T11:52:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T12:22:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T12:52:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T13:22:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed
time="2020-09-11T13:52:54Z" level=debug msg="Pod OnUpdate" pod.iam.role="arn:aws:iam::<account-id>:role/service/<role-name>" pod.name=<pod> pod.namespace=<namespace> pod.status.ip=<ip> pod.status.phase=Failed

Also we're seeing this a lot in our kube2iam pods:

time="2020-09-11T08:56:52Z" level=info msg="GET /latest/meta-data/iam/security-credentials/ (404) took 2464435378.000000 ns" req.method=GET req.path=/latest/meta-data/iam/security-credentials/ req.remote=<ip> res.duration=2.464435378e+09 res.status=404

Is this log line expected?

— Is this a race condition between kube2iam and SOPS where SOPS tries to assume a role before kube2iam has fully assumed it? — Is there a way to set the trust relationship of the role to be able to assume itself?

mwhittington21 commented 3 years ago

Try adding an init container that sleeps for a while and see if that solves the problem. If the pod starts up too quickly it may cause problems with the role assumption, however I believe this particular race was fixed a while ago.

If it is not that then see if you can assume the role on a node without kube2iam as a proxy. If not then the problem lies in the role and trust configurations.

better-sachin commented 3 years ago

looks like this might be a kube2iam issue with starting up too many pods up at once https://github.com/jtblin/kube2iam/issues/136