uswitch / kiam

Integrate AWS IAM with Kubernetes
Apache License 2.0
1.15k stars 238 forks source link

Run kiam-server as deployment instead of daemonset #245

Open mikhailadvani opened 5 years ago

mikhailadvani commented 5 years ago

The manifests included in the deploy folder as well as the stable/kiam helm chart both run the kiam-server as a daemonset. While running the kiam-server pods on a dedicated set of nodes, we are left with nodes running only daemonsets. This results in making the recycling of nodes through cordoning/draining difficult.

Proposed Solution:

thoughts/concerns?

meringu commented 5 years ago

What do you find difficult about draining nodes? There is a flag kubectl drain <node> --ignore-daemonsets, which will drain the node successfully, but leave daemonsets running. The node can then safely be "recycled", in our case terminated.

iverberk commented 5 years ago

We are hitting a similar issue. The trouble is not with the draining, that works fine, but with the agents not detecting in a small enough time-window that the kiam servers have been removed. We are upgrading our workers in a blue / green fashion, with two ASG's. During the drain, the kiam-servers are left untouched because they are running as a daemonset. Of course, on the new nodes they are automatically started. The trouble starts when the old EC2 nodes are terminated. The kiam-servers that were running on those nodes are hard-killed and the kiam-agents seem to have trouble finding the new servers for a very long time.

We've switched the kiam-server from a deamonset to deployment and applied anti-affinity rules to force them on different nodes. There are no observable errors anymore during the upgrade.

However, this still leaves the issue with failing EC2 nodes. Either we are not configuring the agents properly to detect server failure and fail-over over to the available ones, or the connectivity in the agents is just not handling it well. Either way, would love to get some more insight/help on this.

meringu commented 5 years ago

I see. As iam-agent communicateses to kiam server through a Kubernetes service, the kiam-server pods need to be removed.

kiam-agent will also reach a different kiam-server each time, not necessarily on the same node.

Any reason Kiam server needs to run as a daemonset? Only reason I can think of is that the pods should scale proportionally to the cluster, but this could be solved more elegantly with an autoscaler. Changing Kiam server to run as a deployment will not only improve the reliability, but also the cost for larger clusters.

iverberk commented 5 years ago

It seems that the load-balancing via de service is not working properly, at least in our tests. When the kiam-server pods get a hard termination via EC2 node removal, it looks like they are not removed immediately from the service endpoints or agent cache(?). The agents start spitting rpc 500 errors. Frankly, I'm puzzled that not more users are complaining about this. We are following all the default patterns for worker upgrades.

I don't think the daemonset is the right abstraction to deal with the scaling, that could indeed far better be solved with autoscaling the replices based on some resource metric, otherwise it's just a waste of resources and money.

pingles commented 5 years ago

I'd welcome a PR to change it to use Deployments.

Having said that, the gRPC client-side load-balancing uses DNS and so you'd need to make sure the agent resolves all individual IP addresses (rather than a Service IP address). If that happens then the clients should be continually resolving to update their pool. That's why the manifests set the service with clusterIP: None. I don't know whether that covers your use-case but worth checking?

iverberk commented 5 years ago

@pingles I'm not quite sure what you mean. We start the agent with --server-address=kiam-server:443. kiam-server in this case is the service in Kubernetes, which has the following properties:

NAME          TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)            AGE
kiam-server   ClusterIP   None         <none>        9620/TCP,443/TCP   29d

It shows three endpoints. Is there something we need to change or reconfigure to allow the kiam-agent to resolve all the IP's of the kiam-server pods? We need to do further testing but it seems that 'hard' removing a kiam-server causes trouble for the kiam-agent.

meringu commented 5 years ago

Deployment vs daemonset is just a scheduling thing. It shouldn't change the networking as we are not using nodePort.

iverberk commented 5 years ago

I'm not sure if we are talking about the same thing. Our problem probably deserve a separate issue, it's not a matter of scheduling anymore. I just want to make sure that we are properly connecting the agents to the servers. It seems that we are doing this correctly by providing the kiam-server service to the agents. They should resolve the IP addresses from that and load-balance to the servers. Switching from daemonset to deployment definitely benefited us in terms of not having any agent failures during upgrade anymore.

meringu commented 5 years ago

Sorry, let me try be more clear @iverberk.

I agree with you, the change from daemonsets to deployments shouldn't impact the networking

daviddyball commented 5 years ago

Just to throw my 2 cents in here.

We recently started rolling our master nodes to upgrade our K8s version and it took out a large chunk of services as it did so because the kiam-server Service took N-seconds to register that kops had deleted the node via EC2 and that endpoint was invalid... during this time we just so happened to have a bunch of services refreshing their credentials and the agents were getting connection errors and returning empty role to our apps.

The process we used was a simple kops rolling-update cluster --yes and it did 1/3 of our master nodes before we started seeing issues. This led me down the Daemonset vs. Deployment path and I wound up here.

I'll raise a PR to make kiam-server a Deployment now if there is nothing preventing this on kiam's side.

pingles commented 5 years ago

Nope, nothing preventing it. Only constraint is that we want to ensure cluster operators consider how to segregate nodes/workloads to ensure Server processes run away from user workloads.

Thanks in advance for your PR! 😄

daviddyball commented 5 years ago

PR here

thejasbabu commented 5 years ago

The networking issue that @iverberk is facing might be because of the grpc loadbalancing in kubernetes with headless service.

KIAM client needs to register to the kubernetes API and listen to server pod status. kuberesolver is something we use to solve this issue.

mariusmarais commented 4 years ago

Just installed using the deployment YAML and the following tolerations are missing for running on masters:

spec:
  template:
    spec:
      tolerations:
      - key: "node-role.kubernetes.io/master"
        effect: "NoSchedule"
        operator: "Exists"
      # these are automatically added to daemonsets, so we need them manually since deployment: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations
      - key: "node-role.kubernetes.io/not-ready"
        effect: "NoSchedule"
        operator: "Exists"
      - key: "node-role.kubernetes.io/unreachable"
        effect: "NoSchedule"
        operator: "Exists"
      - key: "node-role.kubernetes.io/disk-pressure"
        effect: "NoSchedule"
        operator: "Exists"
      - key: "node-role.kubernetes.io/memory-pressure"
        effect: "NoSchedule"
        operator: "Exists"
      - key: "node-role.kubernetes.io/unschedulable"
        effect: "NoSchedule"
        operator: "Exists"
      - key: "node-role.kubernetes.io/network-unavailable"
        effect: "NoSchedule"
        operator: "Exists"

Daemonsets automatically have extra tolerations added, so we need to add them to the deployment manually if we want the same behavior, see https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations

Depending on the exact goals for moving to deployments, some of them can probably be removed, but I needed not-ready and unreachable on my new cluster.

Otherwise it seems to be working \o/