kubernetes-retired / kube-aws

[EOL] A command-line tool to declaratively manage Kubernetes clusters on AWS
Apache License 2.0
1.12k stars 295 forks source link

Moving ETCD and controllers outside CloudFormation Nested Stacks #1112

Closed camilb closed 5 years ago

camilb commented 6 years ago

Hi @mumoshu, I want to discuss some possible changes in CloudFormation especially regarding the NestedStacks and DesiredCapacity.

Had all the clusters upgraded to 1.9.1 last week. All went fine, except on the last cluster, the last NestedStack (a nodepool) failed to upgrade and started a RollBack. The upgrade failed because a new instance was added and didn't respond by any form (no ssh, no ping, nothing in AWS console "Get System Logs"). Since CF is expecting a signal from the launched instances, the quickest solution was to terminate the instance from console but it didn't launched a new one, so I increased the ASG capacity to be able to receive the signal before a timeout but right after that I received a error in CloudFormation: New SetDesiredCapacity value 20 is below min value 21 for the AutoScalingGroup. And the RollBack started, rolling back all the node pools, controllers and ETCD for ~2h. It's not the first time when something like this happens, once the CloudFormation displayed a message that a new instance was added but didn't show in console so I had to change the ASG manually to avoid signal timeout.

On our CF stacks we never use DesiredCapacity for ASGs because they can be resized due to a traffic increase and CF will rollback the update if the size differs. Then, regarding the NestedStacks, I think ETCD and controllers should be in separated stacks. If the update fails on workers, we should not rollback the ETCD and controllers that were successfully upgraded. Sometimes a rollback will not work for ETCD or controllers with specific versions. For example on test clusters was unable to revert from ETCD 3.2.X to ETCD 3.0.X or from Kubernetes 1.9.X to 1.7.X and this require repeating the upgrade. Workers are quite safe to rollback.

Also, the RollBack process for NestedStacks takes a very long time for large clusters with multiple node pools. I know that using Nested Stacks it's much cleaner, but rolling back everything when we have updates in ETCD or Kubernetes version requires repeating the upgrade to recover the cluster, causing a very long downtime.

I' don't know how we can remove the DesiredCapacity from CF right now, but at least maybe we can move out ETCD and Controllers from Nested Stacks. What do you think?

Fsero commented 6 years ago

@camilb i've been maintaining and testing several k8s clusters using kube-aws the last year and i do agree with your idea, it will be good that we can pin specific k8s version and amiID for the control plane and etcd and another one for workers. That will tremendously help upgrades and updates

redbaron commented 6 years ago

AFAIK amiId can be pinned per each node pool already

mumoshu commented 6 years ago

@camilb Thanks a lot for sharing your hard-won experience.

Yes - I believe this should be "fixed" somehow.

On our CF stacks we never use DesiredCapacity for ASGs because they can be resized due to a traffic increase and CF will rollback the update if the size differs.

I agree DesiredCapacity should not be used anywhere. Actually, any kube-aws-generated stack-template does not include DesiredCapacity since #142 as far as I can remember.

Sometimes a rollback will not work for ETCD or controllers with specific versions. For example on test clusters was unable to revert from ETCD 3.2.X to ETCD 3.0.X or from Kubernetes 1.9.X to 1.7.X and this require repeating the upgrade.

This is really worth noting!

So, we have two things to be considered separately, right?

  1. How to prevent uselessly rolling-back updates to etcd and controller nodes, when the cause of the failure was worker nodes' config or AWS(possibly transient failures).
  2. How to prevent executing a rolling-back a upgrade which involves backward-incompatible change(s), especially for controller and etcd nodes.

For 1., we could split the root stack into three as you've suggested. A downside of it would be that we may need an another workflow-engine like system to reliably update etcd, then controller, and finally worker nodes in this specific order? Otherwise, if we don't need reliability here, we could just do it in the kube-aws command while instructing the user to re-run kube-aws update in case kube-aws failed at the middle of an update.

An alternative approach would be emitting an validation error when an user tried to update the whole cluster in an one-shot.

For 2., we need some domain-specific logics to determine an update is backward-incompatible or not? Otherwise, we can just give up automating it and just allow users to disable rollback via a command-line flag or a specific setting in cluster.yaml.

Regarding amiId, yea - as @redbaron noted, we can configure it per pool. Would it be a matter of documentation?

mumoshu commented 6 years ago

Btw, my take on this problem is improving kube-aws by doing all of the below:

What are your thoughts? Thanks!

camilb commented 6 years ago

Hi @mumoshu, thanks for you fast response. In most of the upgrades I prefer not touching the ETCD nodes. Since 07/2016, when I started using kube-aws I didn't have any issue with ETCD, on any cluster and I prefer to update it less ofen than Kubernetes". So ideally I'm thinking having 3 separate stacks with the possibility to update ETCD and Kubernetes separately.

Something like:

  1. kube-aws update etcd for ETCD upgrade and kube-aws update to upgrade controllers and workers. or
  2. kube-aws update to update everything with a warning, kube-aws update etcd, kube-aws update controllers and kube-aws update workers. In the future we can benefit from a command like kube-aws upgrade workers for people that choose to use AWS's EKS for controllers and ETCD.

Also maybe we can add a bool in cluster.yaml like RollbackOnFailure to control the "Rollback on failure" CF stack option. Have nothing against the rollback when something fails, but this option will allow the user to control when to run the rollback. Like, in my case, only the last node pool failed but already had the instances running, except one, which was not a issue.

Fsero commented 6 years ago

thanks @redbaron i didn't noticed it :beer:

mumoshu commented 6 years ago

@camilb Thanks!

Could you also mind sharing your thought on how we can trigger a roll-back when the new worker nodes are failed actually due to the preceding updates to controller nodes?

Do you just leave already-updated controller nodes as-is and re-run the updates to worker nodes (with appropriate changes in worker config so that new workers can adapt to the recent changes in controller nodes)?

camilb commented 6 years ago

Could you also mind sharing your thought on how we can trigger a roll-back when the new worker nodes are failed actually due to the preceding updates to controller nodes?

@mumoshu When Rollback on failure=No and a stack fails, a user can trigger the rollback manually from the AWS console: Actions ==> Rollback Stack.

If the workers are in a separate stack, we can rollback the workers only, as we can use controllers with greater version (I remember up to 3 releases) until we fix the issues and update the workers stack again.

mumoshu commented 6 years ago

@camilb Thanks! I believe I understood that part.

Excuse me but what I wanted to sync up was that:

In that case, as the stack for controller nodes are successfully updated, you can't trigger a manual rollback, right?

camilb commented 6 years ago

@mumoshu I understand, unfortunately the only option in that case is to fix the controllers stack and update it again. But I think it can be observed as soon as the controllers finish the update. In that case the existing workers will start to fail and if the stacks can be upgraded in two separate commands, like on GKE where you have the option to update the controllers, all the workers, or just a nodepool gcloud container clusters upgrade --master, gcloud container clusters upgrade --cluster-version and gcloud container clusters upgrade --node-pool, then maybe we can add a option for controllers CF stack to fail if existing nodes goes NodeNotReady. In that case we can trigger the rollback for controllers before starting the workers upgrade.

mumoshu commented 6 years ago

@camilb Thanks for the confirmation. I'd like to achieve a similar u/x at least.

At any circumstance, there shouldnt be a suprise such as "why my etcd/controller nodes are being updated even though my changes in cluster.yaml are solely for workers?". Unfortunately this is the u/x we have now. Let take some action!

mumoshu commented 6 years ago

My updated suggestions for improvements:

Use-cases:

In any case, a spare cluster for faster disaster recovery or blue-green cluster deployment(my preference) is recommended.

Down-sides:

mumoshu commented 6 years ago

Perhaps we need to move VPC and subnet definitions from controlplane stack to the root stack?

mumoshu commented 6 years ago

@camilb @Fsero How about the above idea? For me, it seemed to provide a smoother migration/development path while achieving your original goal? @redbaron @c-knowles @danielfm Hi! Do you have any comments? (A backward-incompatible change again!

cknowles commented 6 years ago

Sounds reasonable to align to the UX of gcloud, we use that as well and it seems to work well.

I’m not keen on the overrides.json as it splits the config in two. I’d prefer to pin the AMI on cluster.yaml generation or in source and then add a command to update it using the existing code. i.e. less surprise updates but slightly more surprise if a user is expecting everything to auto update.

camilb commented 6 years ago

@mumoshu From my point of view your proposal it's enough to avoid CF issues in the future, in the last one and a half year didn't have any major issues with kube-aws except the CF updates and almost everytime on worker updates.

Add rollbackOnFailure for controller and etcd stack, respectively. We won't need one for workers?

I don't see any downside by adding rollbackOnFailure on all stacks as the user can trigger the rollback whenever he wants but in some situations, in case of a failure, can allow user intervention first (resizing a pool, launching separate controllers/workers, etc).

Perhaps we need to move VPC and subnet definitions from controlplane stack to the root stack? I think it will be safer.

mumoshu commented 6 years ago

@c-knowles Thanks for your comment!

Yes, it would work too as long as we give up kube-aws itself to automatically set a latest k8s version and a latest AMI ID at runtime of kube-aws update for ease of use.

So, we could also enhance kube-aws here by:

mumoshu commented 6 years ago

In case someone is actually relying on kube-aws update to automatically update k8s, etcd versions and AMI ID, I'd suggest writing a wrapper which perhaps run sed to replace version numbers automatically? πŸ˜„ Do you have something like that you could share with us?

mumoshu commented 6 years ago

Implementation note: It would be possible to do detect which stack(s) are being updated by creating a cfn changeset and inspecting the result, so that we can emit a validation error accordingly.

cknowles commented 6 years ago

What about something like:

Not sure about including k8s and etcd versions as then what about container image versions etc? There's usually quite a bit more to updating than just those versions and a lot of different incompatibilities but the AMI is a little removed from that other than the docker version.

mumoshu commented 6 years ago

Hi @c-knowles! I generally agree with the direction.

Not sure about including k8s and etcd versions as then what about container image versions etc? There's usually quite a bit more to updating than just those versions and a lot of different incompatibilities but the AMI is a little removed from that other than the docker version.

Regarding etcd and k8s versions, I wanted to include them for auto populations as they do result in node replacement once you update the kube-aws binary, even no cluster.yaml or stack templates are changed.

Add a new command to update cluster.yaml to latest version.

How would you like kube-aws to achieve it? There's no yaml parser capable of preserving blankline and comments afaik :cry: I'm afraid that a simple text replacement would beak in various ways.

mumoshu commented 6 years ago

Updated proposal for improving kube-aws for less surprises while upgrading

Use-cases:

In any case, a spare cluster for faster disaster recovery or blue-green cluster deployment(my preference) is recommended.

Down-sides:

whereisaaron commented 6 years ago

Interesting discussion and some good idea. The ideas that jumped out as great to me were:

  1. Being able to disable the automatic rollback: I've been doing a lot of deployments with 0.9.9, and when something is wrong, it become a race between me and CF to diagnose the root cause before CF destroys the evidence. I'd love the option to disable it sometimes.

  2. Inject amiid's into cluster.yaml on init: I agree with @c-knowles that the separate json file sounds clumsy and maybe mysterious to users. I'd rather explicit pinned ami's be added to cluster.yaml, even if I have to manually update those from time to time. It would be a nice assist if kubectl advisor would gather and list the latest relevant amiid and k8s versions available to help me with that. Later, if a reliable YAML patcher is available we could integrate the option to auto-update the cluster.yaml.

  3. Separate stacks: Compared with the old node-pool system (circa 0.9.3) I love being able to have all the settings in one cluster.yaml. But I don't actually enjoy the nested CF stacks. It scares that bacon off me, when updating worker node tags or instance type or something, that I'm going anywhere near the control plane. And when deploying many times to work out the kinks with 0.9.9 configs, I wasted a lot of time watching the etcd cluster being slowly made and unnecessarily destroyed, while I worked out kinks with controller and worker deployment.

whereisaaron commented 6 years ago

@camilb I'd love to hear your experience with upgrading k8s versions in kube-aws clusters. It has never really been a topic in the kube-aws documentation and often it has been noted as 'not yet supported', 'won't include etcd', or 'might not work'.

Are you able to deploy 1.9 clusters with kube-aws 0.9.9?

Are you able to upgrade 1.8 clusters to 1.9 with kube-aws 0.9.9?

Do you just update the hypercube version and go for it?

Any tips or gotcha's you can share or add to the docs somewhere?

mumoshu commented 6 years ago

Thanks for your feedback!

mumoshu commented 6 years ago

Do we need backward-compatibility for this, e.g. a flag to split stacks or not?

Also, running kube-aws update on existing clusters after this change would recrete the whole controle-planes and kube-aws-managed vpc/subnets/etc, which indeed introduces downtime.

I'm ok with recreating every cluster cuz I consider my clusters cattles rather than pets. How about you, everyone? Any idea if you need a better update path?

camilb commented 6 years ago

@whereisaaron I'm unsing the master branch to build kube-aws most of the times. Currently all of my clusters are running the latest changes in master. Some were upgraded from 1.8.x to 1.9.1 and the largest one from 1.7.8 to 1.9.1. Most of the times everything works fine, but had situations where I had to regenerate all the service acoounts tokens and restart most of the pods in kube-system. Except this and CF errors, never had another issue upgrading kube-aws in almost 2 years. We still not have a method to upgrade the kubernetes version only, so all the instances are replaced by kube-aws update. And the network is managed by a separate stack, similar to this one

camilb commented 6 years ago

@mumoshu I'm planning to migrate existing clusters to new ones, without updating.

whereisaaron commented 6 years ago

@mumoshu when you have a large collection of heterogeneous applications in a cluster, the clusters may be cattle, but the applications on them are like pet ticks on the cow you have to locate and transplant to the next beast πŸ˜„

I hopefully at least minor (x.y.z) version in-place upgrades will be possible/supported, again, trying to match the ease of upgrade GKE and the like give you. Separate stacks does make this a little easier, since node pools and controllers are pretty much cattle within a cluster.

whereisaaron commented 6 years ago

@camilb thank you for the pointers @c-knowles had the same issue with Service Accounts when upgrading. Looking at kube-aws-secure I am doing almost the identical thing to you, except I use terraform to create the VPCs, NAT, LBs, etc. and then kube-aws to add cluster(s) within the VPC. You use kube2iam though, does that mean you have IAM roles and kube2iam annotations to keep kube-aws add-ons like the node-drainer (#1105) and resources-auto-backup (#1084) working? They break for me with 0.9.9 as soon as I enable kube2iam and even more breaks if you enable namespace restrictions (which is the default for kiam).

mumoshu commented 6 years ago

@whereisaaron Thanks for the comment, as always!

the applications on them are like pet ticks on the cow you have to locate and transplant to the next beast

Good point πŸ‘

I'd rather want to split clusters into two or more groups, and connect services cross-clusters via e.g. NLB or a kind of service mesh then πŸ˜‰ Sounds like going backward, but pretty much like what we have been doing to not complicate/exploit a single EC2 instance too much in our good old days.

Also, I don't have very much confidence in rolling-update K8S version like GKE do yet, especially after I heard that GKE suddenly started restricting egress to VMs within the same zone after an update! I can't just bring my service down for each time I step on an edge-case(but is it really an edge?) like that.

But anyway -

I hopefully at least minor (x.y.z) version in-place upgrades will be possible/supported, again, trying to match the ease of upgrade GKE and the like give you.

Yeah, ease of upgrade is certainly what I would love, too. Let's keep discussion!

paalkr commented 6 years ago

Sorry for jumping into the discussion a little late.

I have been managing a kube-aws cluster for almost year now, and I agree with most of what has been discussed so for. We are heavy users of spot fleet, and upgrading a spot fleet worker pool is especially messy. A pool of on demand resources will (usually) perform a nice rolling update replacing nodes in a pace that our deployments can cope with. Pods are moved around as nodes are terminated and added - one by one.

But an update of a spot fleet is not a very nice experience, due to how AWS Cloud Formation integrates with spot fleet. Basically the complete fleet are terminated, and a new one recreated. Leading to all nodes being terminated simultaneously and basically removing all resources from the cluster.

In normal operation our on demand worker pools have ZERO resources, and we run only spot fleet workers. We do have a good mix of zones and instance types to minimize the rick of to many resources being terminated. Usually we see a coupe of terminations a week, and this has no effect on our stateless running pods :) They just move to a new node while the spot fleet replaces the terminated instance by other instance type/zone combo.

So to overcome the very bad way AWS CF replaces spot fleets I

whereisaaron commented 6 years ago

@paalkr I've had the same issues and basically we made multiple spot fleet node pools, because back about 0.9.3 node pools could be upgraded individually, and independent of the control plane, since each was its own CF stack. That worked fine because we upgrade and scale up the unused spot fleet node pool, then drain scale down and upgrade the previous spot fleet node pool.

Now is it all one stack it is more difficult to do this cleanly, though many setting are per-node-pool, so you can, fingers-crossed, still upgrade one node pool at a time.

mumoshu commented 6 years ago

@camilb @whereisaaron I have been putting some time towards implementing this.

Well, is it really necessary to break nested stacks just for this use-case?

My question here is basically: what if we render/upload the cfn template just for the update targets(specified via the proposed --only flag)?

I guess it just work for our use-case by updating selected nested stack(s) only. Then, we don't necessarily break nested stacks, right?

paalkr commented 6 years ago

I don't have any strong feelings towards how this is implemented, but it's important that it is possible (somehow) to 100% ensure that only a specified worker pool (and nothing else) is affected when performing a stack update.

mumoshu commented 6 years ago

@paalkr Thx!

Certainly.

Just curious though, what is "affected" and def of "100%" in this case?

It is definitely feasible to limit kube-aws update to actually update only targeted stack, as proposed in this thread. But it does't mean other nodes are never affected.

Say, you've updated just an etcd stack with wrong SGs - it is possible that updated etcd cluster breaks apiserver due to connection issue. The proposed solution in this thread alone forces you to manually rollback etcd in this case.

Of course you can implement every possible reverse healthcheck for every possible communication path, like "are all the apiserver pod on controller nodes able to communicate with me - an updated etcd node?". So that you can stop rolling etcd nodes in case of such failure. It is, though, really cumbersome to implement and nothing gurantees if we have all the possible reverse healthcheck implemented.

Revisiting what we have today, just updating the whole cfn stack along with all the nested ones had been freeing us from such concerns, because e.g. update failure in controller node triggered rollback on etcd nodes as long as it was involved in single kube-aws update.

So, from my viewpoint it seems like there's no perfect solution like you seem to expect. Therefore I'd like to ask you more about expectations and reasonable trade-offs in your mind! Thx.

mumoshu commented 6 years ago

Hmm, I guess my specific example doesn't make sense itself, but hope it still meet the body of myconcern.

paalkr commented 6 years ago

Sure. I'm mostly concerned about controlling which nested stack is updated, not internals in the control plain nested stack itself - I will never upgrade a cluster to a new K8s version anyway, that I always will do by creating a second cluster and redeploying everything.

My main problem is that updating just one pool separately is not currently possible, at least not with 100% confidence that another pool is not rolled back or changed.

mumoshu commented 6 years ago

@paalkr Thanks for the confirmation πŸ‘ Noted. And yes, your practice in production K8S upgrades is what I'd recommend in most cases. In case you missed it, #455 would add a more context on that topic.

whereisaaron commented 6 years ago

Nested stacks sounded like a great idea, but in practice it has some drawbacks:

When we had separate stacks I could make whatever changes to a node pool, confident the risk was contained to that highly disposable node pool. Nesting the control plane with the node pools makes the stakes much higher. I guess you could say 'git gud' and don't make mistakes πŸ˜„

It is really just being able to change a node pool in isolation. Can we while keeping nested stacks via some kube-aws option or masking of updates?

mumoshu commented 6 years ago

Thx for the detailed comment! I'm saying that we can still ensure only one stack is updated, without throwing away nestet stacks :smile:

mumoshu commented 6 years ago

The implementation could be something like: kube-aws uploads the just rendered control-plane stack to S3 when --only controller is provided, thus cfn updates the nested control-plane stack only.

whereisaaron commented 6 years ago

Sounds good @mumoshu πŸ‘

mumoshu commented 6 years ago

@camilb @redbaron @c-knowles @whereisaaron @paalkr

TL;DR; This is getting harder than I have thought πŸ˜‰

Turns out this change requires us to split the stack into:

We need the network(named infrastructure before my edit πŸ˜‰ ) stack for managed VPC, IGW, NAT gateways, subnets, security groups.

Security groups are especially important because without moving those from the control-plane to a shared cfn stack, we have no way to "glue" various SGs without falling into cyclic dependencies of stack inputs/outputs between control-plane and etcd.

This is the similar situation to that why currently worker node pool stack(s) doesn't include those networking related resources but import from the control-plane stack.

mumoshu commented 6 years ago

So, would anyone mind updating all the container images to those for the latest version of Kubernetes v1.9.x, so that we can cut the new kube-aws version without my huge change?

kevtaylor commented 6 years ago

@mumoshu Just to make you and anyone else aware of potential pitfalls in view of subpaths as https://github.com/kubernetes/kubernetes/issues/60813 There are some breaking changes in 1.9.4 and above which have prevented us from moving at present mainly due to limitations in subpaths as detailed here https://github.com/kubernetes/kubernetes/issues/48677

mumoshu commented 6 years ago

@kevtaylor Thanks for the valuable info!

Would you suggest that the next kube-aws release should feature K8S v1.9.3?

Personally, I'm ok with that because I already ensure that this mitigation which is noted in https://github.com/kubernetes/kubernetes/issues/60813 πŸ˜‰

Prevent untrusted users from creating pods

kevtaylor commented 6 years ago

@mumoshu To be quite fair, I am not sure at the moment what our plan of action is regarding our init containers and subpaths which all broke when I tried to go to 1.9.5 - all I can say is that for us 1.9.3 seems to be working and behaving using the latest kube-aws version, but this might be restrictive for other users who don't use the subpath features

mumoshu commented 6 years ago

@kevtaylor Thanks for the response and the info πŸ‘TL;DR; v1.9.3 by default would be ok.

My take on the subpath issue is a regression in K8S. I'd be happy to just stick with a relatively old version without any known major regression like that.

So, my personal preference at this point is that kube-aws sticks with K8S v1.9.3 and just note about the reason we made it so, and provide some guidance for users to configure their clusters to use v1.9.4+.

whereisaaron commented 6 years ago

Would is be better to release with a version like v1.9.6, after both the subpath security vulnerability and regressions were fixed?

The problem release is v1.9.4, since if you upgrade to that release, you'll have a hard time upgrading from that release. So if we release with v1.9.3, then the advice to properly fix the vulnerability should probably be to upgrade directly to v1.9.5/6, do not pass 'Go' but do at all costs pass by v1.9.4 πŸ˜„

v1.10.0 just hit stable, and the release notes the caution that upgrading from v1.9.4 is not supported if you use subpaths, and also recommends only to downgrade to no earlier than v1.9.6.

Since we have the luxury of not having had a v1.9.x release yet, should we could go straight to the fixed version. On the other hand, if v1.9.3 is tested and stable and can release right now, I'd rather have that version and wear the mitigation than have no release.