dgeorgievski commented 6 years ago

I am using kiam v2.6 with kops managed kubernetes 1.9.5 cluster in AWS. Whenever a rolling update is performed, kops rolling-update, some pods have issues accessing their respective IAM roles.

From the agent error log messages it seems that the mapping between the IAM roles and pods is lost. The affected pod is trying to access one IAM role, but Kiam believes it is annotated with another IAM role.

{
    "addr":"100.124.0.4:36284",
    "level":"error",
    "method":"GET",
    "msg":"error processing request: assume role forbidden: requested 'iam-role-one' but annotated with 'iam-role-two', forbidden",
    "path":"/latest/meta-data/iam/security-credentials/iam-role-one",
    "status":403,
    "time":"2018-03-22T19:45:29Z"
}

The workaround is to restart all kiam servers after which the correct mapping is restored. I was able to replicate the issue twice in a row.

The server is started with the following settings:

        - --json-log
        - --level=info
        - --bind=0.0.0.0:443
        - --cert=/etc/kiam/tls/server.pem
        - --key=/etc/kiam/tls/server-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --role-base-arn-autodetect
        - --sync=1m
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

while the agent with:

        - --level=info
        - --iptables
        - --host-interface=weave+
        - --json-log
        - --port=8181
        - --cert=/etc/kiam/tls/agent.pem
        - --key=/etc/kiam/tls/agent-key.pem
        - --ca=/etc/kiam/tls/ca.pem
        - --server-address=kiam-server:443
        - --prometheus-listen-addr=0.0.0.0:9620
        - --prometheus-sync-interval=5s

Please let me know if you need any other information.

Regards,

Dimitar

pingles commented 6 years ago

Interesting. This sounds similar to #47: pods are starting before the watch deltas have been delivered- so the pod cache has the wrong data. Restarting the servers would obviously clear everything and cause a fresh sync fixing things.

Could you re-run and prove that it works after 1 minute (the --sync flag of server controls how frequently it runs a full sync of pods) or reduce the sync interval down.

I'll try and investigate more tomorrow and see what I can find.

To help- could you give me ballparks on the number of server processes, number of pods being restarted together etc. Could you also check that the server processes aren't being throttled/have sufficient limits please.

Thanks for reporting!

dgeorgievski commented 6 years ago

The status of the pods were tested about 10 mins after the last ec2 node was rotated, which means left on their own the kaim servers would most likely not purge the cache from invalid entries.

I'll try repeating the steps with --sync value lowered to 30s, and running the server and agents in debug mode, which should hopefully produce helpful clues on this issue with the cache.

This is a newly provisioned cluster, and it is not used at the moment. The only significant load is produced by the kiam servers, which run on 3 m4.xlarge ec2 instances (4 vCPUs), while the agents run on 4 m4.2xlarge instances (8 vCPU). Kiam server utilization is relatively low, 6% on average, with occasional spikes up to 19-20%.

There are about 50 pods, about 20 different deployments, for which kiam manages IAM access.

pingles commented 6 years ago

This definitely sounds funky. We've been running Kiam for around 6-9 months on clusters with thousands of pods and probably under 100 roles and don't come across this problem.

Could you describe a bit more about the connection between the Kiam server and the api server(s) please? I wonder if there's some additional reliability stuff that needs to be added around the cache's watcher/sync processes- if the sync happens every minute then there shouldn't be a problem after 10 minutes (assuming no other pod movement).

dgeorgievski commented 6 years ago

I've repeated the rolling update of the cluster with server's --sync=30s, and this time I didn't observe the issue with the cache. Both the servers and agents were started with log level set to debug.

The following is a summary of what was happening during the rolling update.

Rolling update started. Kops rotates first the masters one node at the time.

kops rolling-update cluster --name $NAME

node "ip-10-26-66-12.ec2.internal" cordoned
node "ip-10-26-66-12.ec2.internal" cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, 
DaemonSet or StatefulSet: etcd-server-events-ip-10-26-66-12.ec2.internal, 
etcd-server-ip-10-26-66-12.ec2.internal, kube-apiserver-ip-10-26-66-12.ec2.internal, 
kube-controller-manager-ip-10-26-66-12.ec2.internal, kube-proxy-ip-10-26-66-12.ec2.internal, kube-scheduler-ip-10-26-66-12.ec2.internal; 
Ignoring DaemonSet-managed pods: kiam-server-xmxhn, weave-net-hz57j
node "ip-10-26-66-12.ec2.internal" drained
I0323 22:17:20.202050    2347 instancegroups.go:237] Stopping instance "i-019efe0582157955b", node "ip-10-26-66-12.ec2.internal", in group "my-master-group".
I0323 22:22:20.474951    2347 instancegroups.go:161] Validating the cluster.
I0323 22:22:20.745330    2347 instancegroups.go:212] Cluster validated.
I0323 22:22:20.891394    2347 instancegroups.go:130] Draining the node: "ip-10-26-76-147.ec2.internal".

The access to the meta-data end-point was observed from a test pod. The curl command was executed every second. While the masters were rotated, curl would occasionally return the rpc error below, otherwise curl would return the correct IAM role associated with this pod.

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
rpc error: code = DeadlineExceeded desc = context deadline exceeded

During the time period when the kiam agent nodes were rotated no curl error was reported. The agents were correctly reporting when Pods were deleted (rpc error: code = Unknown desc = pod not found), while the kiam servers reported about the deltas being processed.

The results of this test, with sync value set to 30s, are encouraging, but to prove the result was not an anomaly we need at least couple of additional data points. I might need a week to complete the new series of testing, and be happy to report the results back in this issue.

pingles commented 6 years ago

Please do, thank you for doing some tests!

On Sun, 25 Mar 2018 at 17:22, Dimitar Georgievski notifications@github.com wrote:

I've repeated the rolling update of the cluster with server's --sync=30s, and this time I didn't observe the issue with the cache. Both the servers and agents were started with log level set to debug.

The following is a summary of what was happening during the rolling update.

Rolling update started. Kops rotates first the masters one node at the time.

kops rolling-update cluster --name $NAME

node "ip-10-26-66-12.ec2.internal" cordoned node "ip-10-26-66-12.ec2.internal" cordoned WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: etcd-server-events-ip-10-26-66-12.ec2.internal, etcd-server-ip-10-26-66-12.ec2.internal, kube-apiserver-ip-10-26-66-12.ec2.internal, kube-controller-manager-ip-10-26-66-12.ec2.internal, kube-proxy-ip-10-26-66-12.ec2.internal, kube-scheduler-ip-10-26-66-12.ec2.internal; Ignoring DaemonSet-managed pods: kiam-server-xmxhn, weave-net-hz57j node "ip-10-26-66-12.ec2.internal" drained I0323 22:17:20.202050 2347 instancegroups.go:237] Stopping instance "i-019efe0582157955b", node "ip-10-26-66-12.ec2.internal", in group "my-master-group". I0323 22:22:20.474951 2347 instancegroups.go:161] Validating the cluster. I0323 22:22:20.745330 2347 instancegroups.go:212] Cluster validated. I0323 22:22:20.891394 2347 instancegroups.go:130] Draining the node: "ip-10-26-76-147.ec2.internal".

The access to the meta-data end-point was observed from a test pod. The curl command was executed every second. While the masters were rotated, curl would occasionally return the rpc error below, otherwise curl would return the correct IAM role associated with this pod.

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ rpc error: code = DeadlineExceeded desc = context deadline exceeded

During the time period when the kiam agent nodes were rotated no curl error was reported. The agents were correctly reporting when Pods were deleted (rpc error: code = Unknown desc = pod not found), while the kiam servers reported about the deltas being processed.

The results of this test, with sync value set to 30s, are encouraging, but to prove the result was not an anomaly we need at least couple of additional data points. I might need a week to complete the new series of testing, and be happy to report the results back in this issue.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/48#issuecomment-375982836, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEfjivGCm2MSwti6LCdgAntUchSz9zks5th8RIgaJpZM4S3uRH .

dgeorgievski commented 6 years ago

Unfortunately the issue appeared again in the 2nd test with server sync set to 30s. I can also confirm I see the issue mentioned in #47 as well.

For instance, one deployment with 3 replicas, 2 pods reported "pod not found" error, while the 3rd reported the issue raised by this ticket "assume role forbidden". As usual the restart of the servers fixed the issue.

I also noticed in the server logs that the servers were reporting "announced pod" with pod.status.phase in Running for the pods reported as missing by the agents.

Unfortunately at the moment there is no collection of the logs so I am not able to provide the full time-line of the events.

pingles commented 6 years ago

:(

Could you attach the logs from the servers and agents around the time please (anonymising pod names, ips etc if you need to). It’d be useful to have a scan through.

The pod announcement is used for the prefetcher but should happen after the k8s client’s cache has been updated with the new state- v weird that fails (although could be if you’re running multiple servers, one of them is updated and the others aren’t, and then the short timeout causes the agent to give up after checking the other servers that are missing the state). Hmmm, might be time to speculatively lookup the pod in all servers at the same time.

Ok, will do some more thinking and see what we could do. If you could send over any of the log data that might be useful too.

Thanks as ever for reporting and keeping us posted!

On Tue, 27 Mar 2018 at 19:33, Dimitar Georgievski notifications@github.com wrote:

Unfortunately the issue appeared again in the 2nd test with server sync set to 30s. I can also confirm I see the issue mentioned in #47 https://github.com/uswitch/kiam/issues/47 as well.

For instance, one deployment with 3 replicas, 2 pods reported "pod not found" error, while the 3rd reported the issue raised by this ticket "assume role forbidden". As usual the restart of the servers fixed the issue.

I also noticed in the server logs that the servers were reporting "announced pod" with pod.status.phase in Running for the pods reported as missing by the agents.

Unfortunately at the moment there is no collection of the logs so I am not able to provide the full time-line of the events.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/uswitch/kiam/issues/48#issuecomment-376629385, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAEfkXfDU-M-LPrwAtj6GMLwvTNddmxks5tioYIgaJpZM4S3uRH .

kevtaylor commented 6 years ago

We have this same issue - rolled a cluster this morning and certain pods obtained credentials and others didn't

When I was looking at the kiam server log there was a lot of "pod not found errors"

I am sorry, but I then bounced the pods so I wasn't able to capture the details, but I'll keep an eye as we roll this out

pingles commented 6 years ago

If you could capture any of the server log data that would be useful- it could potentially be slow updates from the api server (which would ultimately correct after the client retries a few times), or there's an error that the Kubernetes client code within Kiam doesn't correct from- potentially as a result of restarting the api servers as part of a cluster upgrade.

46 mentions a type coercion error that I need to fix (its a trivial change to make) which may be related- if you do still have access to your logs elsewhere it'd be useful to see if the same error is reported.

I'll try and sort #46 asap in case that happens to be the same cause and let you know when an update is ready to go @kevtaylor?

Thanks to everyone for continuing to report these problems.

pingles commented 6 years ago

@kevtaylor I just pushed an update in 562b2e63e008a4733b71d1242804583ff127975a which is on a branch currently- that should handle the deleted state unknown errors if you want to give it a try and see if that prevents errors you saw.

@alexgaganov-fiverr I wonder if this might also help #47 that you reported- if you could try it out too that would be super helpful.

I've made a note to try and get more time to tidy and improve the cache handling code, I'm hoping there may be some other structures in the client libs that can avoid a lot of the boiler plate and improve its behaviour.

pingles commented 6 years ago

If you want to try it, it'll be available on quay.io:

quay.io/uswitch/kiam:562b2e63e008a4733b71d1242804583ff127975a

kevtaylor commented 6 years ago

@pingles Rolled a cluster with that version today and seems to be behaving itself - will keep an eye on any issues

pingles commented 6 years ago

Cool, thank you. Keep us posted!

dgeorgievski commented 6 years ago

My apologies for the longer absence. I've been busy with work.

Since my last posting I've been performing regular tests with mixed results. I yet have to try running a test with the image mentioned above, but I do have comments on caching of IAM credentials. I'll try that next.

From the limited experience I had with kiam and kube2iam in managing Pods access privileges in AWS cloud, I can tell that caching could improve reliability of the managed services but only under certain conditions.

In case of conflict (restarted pod appear to have a different IAM annotation than that cached) I think the Pod cache should discard its current content, and refresh IAM roles for the pod instead of throwing assume role errors. In other words the Pod always wins when asserting the correct role for it.

pingles commented 6 years ago

@dgeorgievski thanks- if you could try the updated image as it fixed a bug which did result in the behaviour described.

Interesting idea about the state update though- I'll have a think on it.

uswitch / kiam

Issues with IAM role access after a rolling update of the k8s cluster #48

46 mentions a type coercion error that I need to fix (its a trivial change to make) which may be related- if you do still have access to your logs elsewhere it'd be useful to see if the same error is reported.