On-the-fly agent replacement

max-lobur commented 5 years ago

Environment: AWS, k8s 1.10, kops, calico v2.6.7, kiam helm chart from this PR https://github.com/helm/charts/pull/9262 , quay.io/uswitch/kiam:v3.0-rc1.

We've got into an issue when did kiam upgrade:

helm upgrade
rotate kiam server pods 1 at time
rotate kiam agent pods 1 at time.

Server part went smooth. After agent part, the new agent pod became invisible for all the existing pods: their metadata calls were not intercepted by kiam agent and they were getting bare node role instead of one set in the annotation. This condition persists for at least 30 min, after that I re-create all affected pods and they started to work normally - seeing the annotation role.

In other words, the new kiam agent pod works only for pods created after itself.

Right now we are doing node replacement to rotate an agent. Am I missing something in docs, is this expected, can be avoided?

This also means that crashed kiam agent pod means crashed node, we gotta do custom node healthchecks and so on to make sure we do not get into this condition.

pingles commented 5 years ago

Thanks for reporting the problems you had. The agent command has a flag that will have it insert an iptables rule to intercept metadata traffic and it also removes it by default when the process terminates (which could leave a window where the iptables rule isn't inserted).

One option is to change your node configuration to insert the iptables rule as part of the system initialisation process. I'd also happily take a PR for an additional agent flag for something like --no-iptables-removal-on-termination to prevent Kiam from dropping the rule on termination.

As for pods failing for longer, that feels like an issue we've seen in the past with poor application behaviour: the standard AWS SDKs appear to cause exceptions/errors as soon as they're unable to access the metadata API (or the kiam agent) even if they have credentials that remain valid. Not all applications have failure handling in the right places (so a call to send an SQS message could fail because the credentials are no longer valid but the application won't recreate an authenticated session, for example). Could you confirm whether this is a plausible explanation for what you experienced?

Based on what you describe it sounds like setting iptables rules on node boot/preventing kiam from deleting would help although probably won't guarantee you wouldn't see the problem again. Node replacement will guarantee you won't see the problem (and doesn't require application changes), alternatively I think it's probably a case of reviewing the failure handling within applications.

max-lobur commented 5 years ago

@pingles ++ on system level rule installation. Will implement that.

This is not an app issue for sure. We used those apps for half a year already on kiam v2, with zero issues. I got into this once upgraded kiam to v3-rc1 and replaced the agent on a running node.

Steps to reproduce:

create kiam agent pod (v3rc1), iptables flag.
create app pod with role annotation.
check that app pod works (in our case this is kinesis consumer, continious use of the role).
delete kiam agent pod (so that daemonset creates the new one, same version, v3rc1).
app pod stops getting the role, no intercepted by an agent (Expected).
agent starts, iptables rule added back.
app pod still does not work, still not intercepted (Not Expected). This persists for at least 30min, didn't check more.
kill app pod (deployment created the new one on the same node).
the new pod now intercepted and works as expected.

Sys level rule should close all the gaps anyways.

Worth mentioning in kiam docs?

pingles commented 5 years ago

My question was why those application pods aren't obtaining new roles once the rule was re-added: the iptables update should apply immediately for all traffic. Yes restarting the pods fixed it (because they'd restart and presumably recreate any AWS sessions within their client libs) but I'm not sure it's because the iptables interception wasn't taking effect.

As you say, system rule should address though.

Yep- be good to mention it in the docs somewhere, maybe it's worth starting something around production deployments in a ./docs/PRODUCTION_OPS.md (pulling together stuff on setting up IAM, why we run servers and agents on different nodes, inserting iptables on the host etc.?). @uswitch/cloud any other stuff worth mentioning?

max-lobur commented 5 years ago

I will add more details once I have. The app continues to try re-auth, but keeps getting 403 because I see it is doing it using the node role. This is an interception issue, cause initially, it was app role, then it changes to node role quickly (when I kill agent), and never changes back to app role

I'd love to see PRODUCTION_OPS, I can contribute too.

bboreham commented 5 years ago

We had some issues on kiam agent restart: it appears that if a credentials request goes through to the real AWS metadata service while the iptables rule is absent, the pod will get invalid credentials with an expiry time of 6 hours so will not re-request via kiam. (Using AWS go-client)

+1 on moving the rule out of pod startup to system level, and +10 on documenting this.

pingles commented 5 years ago

+1 on moving the rule out of pod startup to system level, and +10 on documenting this.

Could you maybe open a separate issue for just this? I.e. not removing the --iptables flag but maybe create a shell script that sets up the iptables rule and reference it in the README? (also totally open to better suggestions). Be great to have something in.

bboreham commented 5 years ago

I (or one of my colleagues) will probably do a PR.

I'm unclear why you think it is a separate issue - the window during which requests are not intercepted seems to me to be the main issue.

pingles commented 5 years ago

You're totally right, ignore me. Please get a PR up and we can close this 😄

Nuru commented 5 years ago

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Nuru commented 5 years ago

Plus I would also like to see documentation (not just a link to go code) on how to set up the iptables rules on the nodes outside of the kiam-agent, particularly in a way that allows the instance itself to have access to its instance role.

pingles commented 5 years ago

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Thankfully @theatrus contributed this in #253.

Agreed on documentation for configuring on nodes. We'd be super happy to have more docs/production operations notes contributed.

project0 commented 4 years ago

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Well, that prevents clients retrieving wrong credentials, which is already better than before. But updating the agents can still cause a short downtime and errors for some apps, i wish there would be a good solution to update with zero downtime. Does anyone have some ideas regarding this? Currently we disabled rolling updates and replace all nodes to force the update, so they can apply the new daemon set config.

Nuru commented 4 years ago

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Well, that prevents clients retrieving wrong credentials, which is already better than before. But updating the agents can still cause a short downtime and errors for some apps, i wish there would be a good solution to update with zero downtime. Does anyone have some ideas regarding this?

The apps should be caching the credentials rather than requesting them on every call, and the credentials should be good for an hour at least. Credential caching is supported by all the AWS SDKs as far as I know. So the solution is to fix the apps. I know that is not necessarily a practical solution, but I think the fact that credential caching is expected behavior means it is not worth trying to create some setup where the new agent comes up on the same instance as the old instance and takes over the local metadata network connection before the old agent shuts down. That is not something Kubernetes supports.

uswitch / kiam

On-the-fly agent replacement #202