Closed camilb closed 5 years ago
@camilb i've been maintaining and testing several k8s clusters using kube-aws the last year and i do agree with your idea, it will be good that we can pin specific k8s version and amiID for the control plane and etcd and another one for workers. That will tremendously help upgrades and updates
AFAIK amiId
can be pinned per each node pool already
@camilb Thanks a lot for sharing your hard-won experience.
Yes - I believe this should be "fixed" somehow.
On our CF stacks we never use DesiredCapacity for ASGs because they can be resized due to a traffic increase and CF will rollback the update if the size differs.
I agree DesiredCapacity should not be used anywhere. Actually, any kube-aws-generated stack-template does not include DesiredCapacity since #142 as far as I can remember.
Sometimes a rollback will not work for ETCD or controllers with specific versions. For example on test clusters was unable to revert from ETCD 3.2.X to ETCD 3.0.X or from Kubernetes 1.9.X to 1.7.X and this require repeating the upgrade.
This is really worth noting!
So, we have two things to be considered separately, right?
For 1., we could split the root stack into three as you've suggested. A downside of it would be that we may need an another workflow-engine like system to reliably update etcd, then controller, and finally worker nodes in this specific order?
Otherwise, if we don't need reliability here, we could just do it in the kube-aws command while instructing the user to re-run kube-aws update
in case kube-aws failed at the middle of an update.
An alternative approach would be emitting an validation error when an user tried to update the whole cluster in an one-shot.
For 2., we need some domain-specific logics to determine an update is backward-incompatible or not? Otherwise, we can just give up automating it and just allow users to disable rollback via a command-line flag or a specific setting in cluster.yaml.
Regarding amiId, yea - as @redbaron noted, we can configure it per pool. Would it be a matter of documentation?
Btw, my take on this problem is improving kube-aws by doing all of the below:
What are your thoughts? Thanks!
Hi @mumoshu, thanks for you fast response. In most of the upgrades I prefer not touching the ETCD nodes. Since 07/2016, when I started using kube-aws I didn't have any issue with ETCD, on any cluster and I prefer to update it less ofen than Kubernetes". So ideally I'm thinking having 3 separate stacks with the possibility to update ETCD and Kubernetes separately.
Something like:
kube-aws update etcd
for ETCD upgrade and kube-aws update
to upgrade controllers and workers.
orkube-aws update
to update everything with a warning, kube-aws update etcd
, kube-aws update controllers
and kube-aws update workers
.
In the future we can benefit from a command like kube-aws upgrade workers
for people that choose to use AWS's EKS
for controllers and ETCD.Also maybe we can add a bool in cluster.yaml
like RollbackOnFailure
to control the "Rollback on failure" CF stack option. Have nothing against the rollback when something fails, but this option will allow the user to control when to run the rollback. Like, in my case, only the last node pool failed but already had the instances running, except one, which was not a issue.
thanks @redbaron i didn't noticed it :beer:
@camilb Thanks!
Could you also mind sharing your thought on how we can trigger a roll-back when the new worker nodes are failed actually due to the preceding updates to controller nodes?
Do you just leave already-updated controller nodes as-is and re-run the updates to worker nodes (with appropriate changes in worker config so that new workers can adapt to the recent changes in controller nodes)?
Could you also mind sharing your thought on how we can trigger a roll-back when the new worker nodes are failed actually due to the preceding updates to controller nodes?
@mumoshu When Rollback on failure=No
and a stack fails, a user can trigger the rollback manually from the AWS console: Actions ==> Rollback Stack.
If the workers are in a separate stack, we can rollback the workers only, as we can use controllers with greater version (I remember up to 3 releases) until we fix the issues and update the workers stack again.
@camilb Thanks! I believe I understood that part.
Excuse me but what I wanted to sync up was that:
In that case, as the stack for controller nodes are successfully updated, you can't trigger a manual rollback, right?
@mumoshu I understand, unfortunately the only option in that case is to fix the controllers stack and update it again. But I think it can be observed as soon as the controllers finish the update. In that case the existing workers will start to fail and if the stacks can be upgraded in two separate commands, like on GKE where you have the option to update the controllers, all the workers, or just a nodepool gcloud container clusters upgrade --master
, gcloud container clusters upgrade --cluster-version
and gcloud container clusters upgrade --node-pool
, then maybe we can add a option for controllers CF stack to fail if existing nodes goes NodeNotReady
. In that case we can trigger the rollback for controllers before starting the workers upgrade.
@camilb Thanks for the confirmation. I'd like to achieve a similar u/x at least.
At any circumstance, there shouldnt be a suprise such as "why my etcd/controller nodes are being updated even though my changes in cluster.yaml are solely for workers?". Unfortunately this is the u/x we have now. Let take some action!
My updated suggestions for improvements:
kube-aws update --only controlle,worker
. If the etcd stack is being updated in this case, emit an validation error and tell the user to either include 'etcd' to the flag or revert recent modifications in cluster.yaml affecting etcd stack.kube-aws update
to persist AMI IDs used across worker, etcd and controller nodes, in something like overrides.json suggested in https://github.com/kubernetes-incubator/kube-aws/issues/1049#issuecomment-349010766 It should be named different - may be defaults.json?
kube-aws update
. kube-aws update --only worker
updates AMI ID just for worker. So `kube-aws update--yes
(y in short). A kube-aws update
without the flag prompts the user to approve changes to stacks. Example: 3 stacks(etcd, worker, controller) are being updated. It may take a long time to roll-back when this update is failed at very end of the process. Do you procece? [y/n]
. Exit with the same message but withou a prompt in case a tty is missing.rollbackOnFailure
for controller and etcd stack, respectively. We won't need one for workers?Use-cases:
kube-aws update -y
to retain existing behavior. Good for GitOps.kube-aws update --only controller,worker
to ever touch etcd stack, possibly set controller.rollbackOnFailure: false
in cluster.yaml(@camilb)kube-aws update
for other use.In any case, a spare cluster for faster disaster recovery or blue-green cluster deployment(my preference) is recommended.
Down-sides:
kube-aws update
on the two nested stacks cluster woud incur downtime starting the replacement of etcd, until controller stack becomes up.Perhaps we need to move VPC and subnet definitions from controlplane stack to the root stack?
@camilb @Fsero How about the above idea? For me, it seemed to provide a smoother migration/development path while achieving your original goal? @redbaron @c-knowles @danielfm Hi! Do you have any comments? (A backward-incompatible change again!
Sounds reasonable to align to the UX of gcloud, we use that as well and it seems to work well.
Iβm not keen on the overrides.json as it splits the config in two. Iβd prefer to pin the AMI on cluster.yaml generation or in source and then add a command to update it using the existing code. i.e. less surprise updates but slightly more surprise if a user is expecting everything to auto update.
@mumoshu From my point of view your proposal it's enough to avoid CF issues in the future, in the last one and a half year didn't have any major issues with kube-aws
except the CF updates and almost everytime on worker updates.
Add rollbackOnFailure for controller and etcd stack, respectively. We won't need one for workers?
I don't see any downside by adding rollbackOnFailure
on all stacks as the user can trigger the rollback whenever he wants but in some situations, in case of a failure, can allow user intervention first (resizing a pool, launching separate controllers/workers, etc).
Perhaps we need to move VPC and subnet definitions from controlplane stack to the root stack? I think it will be safer.
@c-knowles Thanks for your comment!
Yes, it would work too as long as we give up kube-aws itself to automatically set a latest k8s version and a latest AMI ID at runtime of kube-aws update
for ease of use.
So, we could also enhance kube-aws here by:
overrides.json
but instead
kubernetesVersion
and etcd.version
for respective default values for the specific kube-aws version in generated cluster.yaml
. kube-aws would ever update it automatically.kubernetesVersion
and/or amiId
, etcd.version
.kubernetesVersion
and/or etcd.version
, while printing the default versions embeded in the kube-aws binary being run.In case someone is actually relying on kube-aws update
to automatically update k8s, etcd versions and AMI ID, I'd suggest writing a wrapper which perhaps run sed
to replace version numbers automatically? π
Do you have something like that you could share with us?
Implementation note: It would be possible to do detect which stack(s) are being updated by creating a cfn changeset and inspecting the result, so that we can emit a validation error accordingly.
What about something like:
Not sure about including k8s and etcd versions as then what about container image versions etc? There's usually quite a bit more to updating than just those versions and a lot of different incompatibilities but the AMI is a little removed from that other than the docker version.
Hi @c-knowles! I generally agree with the direction.
Not sure about including k8s and etcd versions as then what about container image versions etc? There's usually quite a bit more to updating than just those versions and a lot of different incompatibilities but the AMI is a little removed from that other than the docker version.
Regarding etcd and k8s versions, I wanted to include them for auto populations as they do result in node replacement once you update the kube-aws binary, even no cluster.yaml or stack templates are changed.
Add a new command to update cluster.yaml to latest version.
How would you like kube-aws to achieve it? There's no yaml parser capable of preserving blankline and comments afaik :cry: I'm afraid that a simple text replacement would beak in various ways.
kube-aws update --only controlle,worker
.
kube-aws update
to persist AMI IDs used across worker, etcd and controller nodes
kube-aws init
kube-aws update
. kube-aws update --only worker
updates AMI ID just for worker. So `kube-aws updatekube-aws update
as there's no reliable way to do so. See https://github.com/kubernetes-incubator/kube-aws/issues/1112#issuecomment-358184892--yes
(y in short).
kube-aws update
without the flag prompts the user to approve changes to stacks. Example: 3 stacks(etcd, worker, controller) are being updated. It may take a long time to roll-back when this update is failed at very end of the process. Do you like to proceed? [y/n]
. Exit with the same message but withou a prompt in case a tty is missing.rollbackOnFailure
for controller and etcd stack, respectively.
Use-cases:
kube-aws update -y
to retain existing behavior. Good for GitOps.kube-aws update --only controller,worker
to ever touch etcd stack, possibly set controller.rollbackOnFailure: false
in cluster.yaml(@camilb)kube-aws update
for other use.In any case, a spare cluster for faster disaster recovery or blue-green cluster deployment(my preference) is recommended.
Down-sides:
kube-aws update
on the two nested stacks cluster woud incur downtime starting the replacement of etcd, until controller stack becomes up.Interesting discussion and some good idea. The ideas that jumped out as great to me were:
Being able to disable the automatic rollback: I've been doing a lot of deployments with 0.9.9, and when something is wrong, it become a race between me and CF to diagnose the root cause before CF destroys the evidence. I'd love the option to disable it sometimes.
Inject amiid's into cluster.yaml on init: I agree with @c-knowles that the separate json file sounds clumsy and maybe mysterious to users. I'd rather explicit pinned ami's be added to cluster.yaml
, even if I have to manually update those from time to time. It would be a nice assist if kubectl advisor
would gather and list the latest relevant amiid and k8s versions available to help me with that. Later, if a reliable YAML patcher is available we could integrate the option to auto-update the cluster.yaml
.
Separate stacks: Compared with the old node-pool system (circa 0.9.3) I love being able to have all the settings in one cluster.yaml
. But I don't actually enjoy the nested CF stacks. It scares that bacon off me, when updating worker node tags or instance type or something, that I'm going anywhere near the control plane. And when deploying many times to work out the kinks with 0.9.9 configs, I wasted a lot of time watching the etcd
cluster being slowly made and unnecessarily destroyed, while I worked out kinks with controller and worker deployment.
@camilb I'd love to hear your experience with upgrading k8s versions in kube-aws
clusters. It has never really been a topic in the kube-aws
documentation and often it has been noted as 'not yet supported', 'won't include etcd', or 'might not work'.
Are you able to deploy 1.9 clusters with kube-aws
0.9.9?
Are you able to upgrade 1.8 clusters to 1.9 with kube-aws
0.9.9?
Do you just update the hypercube version and go for it?
Any tips or gotcha's you can share or add to the docs somewhere?
Thanks for your feedback!
Do we need backward-compatibility for this, e.g. a flag to split stacks or not?
Also, running kube-aws update
on existing clusters after this change would recrete the whole controle-planes and kube-aws-managed vpc/subnets/etc, which indeed introduces downtime.
I'm ok with recreating every cluster cuz I consider my clusters cattles rather than pets. How about you, everyone? Any idea if you need a better update path?
@whereisaaron I'm unsing the master branch to build kube-aws most of the times. Currently all of my clusters are running the latest changes in master. Some were upgraded from 1.8.x to 1.9.1 and the largest one from 1.7.8 to 1.9.1.
Most of the times everything works fine, but had situations where I had to regenerate all the service acoounts tokens and restart most of the pods in kube-system
. Except this and CF errors, never had another issue upgrading kube-aws in almost 2 years.
We still not have a method to upgrade the kubernetes version only, so all the instances are replaced by kube-aws update
. And the network is managed by a separate stack, similar to this one
@mumoshu I'm planning to migrate existing clusters to new ones, without updating.
@mumoshu when you have a large collection of heterogeneous applications in a cluster, the clusters may be cattle, but the applications on them are like pet ticks on the cow you have to locate and transplant to the next beast π
I hopefully at least minor (x.y.z) version in-place upgrades will be possible/supported, again, trying to match the ease of upgrade GKE and the like give you. Separate stacks does make this a little easier, since node pools and controllers are pretty much cattle within a cluster.
@camilb thank you for the pointers @c-knowles had the same issue with Service Accounts when upgrading. Looking at kube-aws-secure
I am doing almost the identical thing to you, except I use terraform
to create the VPCs, NAT, LBs, etc. and then kube-aws
to add cluster(s) within the VPC. You use kube2iam
though, does that mean you have IAM roles and kube2iam
annotations to keep kube-aws
add-ons like the node-drainer (#1105) and resources-auto-backup (#1084) working? They break for me with 0.9.9 as soon as I enable kube2iam
and even more breaks if you enable namespace restrictions (which is the default for kiam
).
@whereisaaron Thanks for the comment, as always!
the applications on them are like pet ticks on the cow you have to locate and transplant to the next beast
Good point π
I'd rather want to split clusters into two or more groups, and connect services cross-clusters via e.g. NLB or a kind of service mesh then π Sounds like going backward, but pretty much like what we have been doing to not complicate/exploit a single EC2 instance too much in our good old days.
Also, I don't have very much confidence in rolling-update K8S version like GKE do yet, especially after I heard that GKE suddenly started restricting egress to VMs within the same zone after an update! I can't just bring my service down for each time I step on an edge-case(but is it really an edge?) like that.
But anyway -
I hopefully at least minor (x.y.z) version in-place upgrades will be possible/supported, again, trying to match the ease of upgrade GKE and the like give you.
Yeah, ease of upgrade is certainly what I would love, too. Let's keep discussion!
Sorry for jumping into the discussion a little late.
I have been managing a kube-aws cluster for almost year now, and I agree with most of what has been discussed so for. We are heavy users of spot fleet, and upgrading a spot fleet worker pool is especially messy. A pool of on demand resources will (usually) perform a nice rolling update replacing nodes in a pace that our deployments can cope with. Pods are moved around as nodes are terminated and added - one by one.
But an update of a spot fleet is not a very nice experience, due to how AWS Cloud Formation integrates with spot fleet. Basically the complete fleet are terminated, and a new one recreated. Leading to all nodes being terminated simultaneously and basically removing all resources from the cluster.
In normal operation our on demand worker pools have ZERO resources, and we run only spot fleet workers. We do have a good mix of zones and instance types to minimize the rick of to many resources being terminated. Usually we see a coupe of terminations a week, and this has no effect on our stateless running pods :) They just move to a new node while the spot fleet replaces the terminated instance by other instance type/zone combo.
So to overcome the very bad way AWS CF replaces spot fleets I
@paalkr I've had the same issues and basically we made multiple spot fleet node pools, because back about 0.9.3 node pools could be upgraded individually, and independent of the control plane, since each was its own CF stack. That worked fine because we upgrade and scale up the unused spot fleet node pool, then drain scale down and upgrade the previous spot fleet node pool.
Now is it all one stack it is more difficult to do this cleanly, though many setting are per-node-pool, so you can, fingers-crossed, still upgrade one node pool at a time.
@camilb @whereisaaron I have been putting some time towards implementing this.
Well, is it really necessary to break nested stacks just for this use-case?
My question here is basically: what if we render/upload the cfn template just for the update targets(specified via the proposed --only
flag)?
I guess it just work for our use-case by updating selected nested stack(s) only. Then, we don't necessarily break nested stacks, right?
I don't have any strong feelings towards how this is implemented, but it's important that it is possible (somehow) to 100% ensure that only a specified worker pool (and nothing else) is affected when performing a stack update.
@paalkr Thx!
Certainly.
Just curious though, what is "affected" and def of "100%" in this case?
It is definitely feasible to limit kube-aws update
to actually update only targeted stack, as proposed in this thread. But it does't mean other nodes are never affected.
Say, you've updated just an etcd stack with wrong SGs - it is possible that updated etcd cluster breaks apiserver due to connection issue. The proposed solution in this thread alone forces you to manually rollback etcd in this case.
Of course you can implement every possible reverse healthcheck for every possible communication path, like "are all the apiserver pod on controller nodes able to communicate with me - an updated etcd node?". So that you can stop rolling etcd nodes in case of such failure. It is, though, really cumbersome to implement and nothing gurantees if we have all the possible reverse healthcheck implemented.
Revisiting what we have today, just updating the whole cfn stack along with all the nested ones had been freeing us from such concerns, because e.g. update failure in controller node triggered rollback on etcd nodes as long as it was involved in single kube-aws update
.
So, from my viewpoint it seems like there's no perfect solution like you seem to expect. Therefore I'd like to ask you more about expectations and reasonable trade-offs in your mind! Thx.
Hmm, I guess my specific example doesn't make sense itself, but hope it still meet the body of myconcern.
Sure. I'm mostly concerned about controlling which nested stack is updated, not internals in the control plain nested stack itself - I will never upgrade a cluster to a new K8s version anyway, that I always will do by creating a second cluster and redeploying everything.
My main problem is that updating just one pool separately is not currently possible, at least not with 100% confidence that another pool is not rolled back or changed.
@paalkr Thanks for the confirmation π Noted. And yes, your practice in production K8S upgrades is what I'd recommend in most cases. In case you missed it, #455 would add a more context on that topic.
Nested stacks sounded like a great idea, but in practice it has some drawbacks:
terraform plan
) to check I'm not going to impact the control plane (because CF nested stacks don't support change sets)When we had separate stacks I could make whatever changes to a node pool, confident the risk was contained to that highly disposable node pool. Nesting the control plane with the node pools makes the stakes much higher. I guess you could say 'git gud' and don't make mistakes π
It is really just being able to change a node pool in isolation. Can we while keeping nested stacks via some kube-aws
option or masking of updates?
Thx for the detailed comment! I'm saying that we can still ensure only one stack is updated, without throwing away nestet stacks :smile:
The implementation could be something like: kube-aws uploads the just rendered control-plane stack to S3 when --only controller
is provided, thus cfn updates the nested control-plane stack only.
Sounds good @mumoshu π
@camilb @redbaron @c-knowles @whereisaaron @paalkr
TL;DR; This is getting harder than I have thought π
Turns out this change requires us to split the stack into:
We need the network
(named infrastructure
before my edit π ) stack for managed VPC, IGW, NAT gateways, subnets, security groups.
Security groups are especially important because without moving those from the control-plane to a shared cfn stack, we have no way to "glue" various SGs without falling into cyclic dependencies of stack inputs/outputs between control-plane and etcd.
This is the similar situation to that why currently worker node pool stack(s) doesn't include those networking related resources but import from the control-plane stack.
So, would anyone mind updating all the container images to those for the latest version of Kubernetes v1.9.x, so that we can cut the new kube-aws version without my huge change?
@mumoshu Just to make you and anyone else aware of potential pitfalls in view of subpaths as https://github.com/kubernetes/kubernetes/issues/60813 There are some breaking changes in 1.9.4 and above which have prevented us from moving at present mainly due to limitations in subpaths as detailed here https://github.com/kubernetes/kubernetes/issues/48677
@kevtaylor Thanks for the valuable info!
Would you suggest that the next kube-aws release should feature K8S v1.9.3?
Personally, I'm ok with that because I already ensure that this mitigation which is noted in https://github.com/kubernetes/kubernetes/issues/60813 π
Prevent untrusted users from creating pods
@mumoshu To be quite fair, I am not sure at the moment what our plan of action is regarding our init containers and subpaths which all broke when I tried to go to 1.9.5 - all I can say is that for us 1.9.3 seems to be working and behaving using the latest kube-aws version, but this might be restrictive for other users who don't use the subpath features
@kevtaylor Thanks for the response and the info πTL;DR; v1.9.3 by default would be ok.
My take on the subpath issue is a regression in K8S. I'd be happy to just stick with a relatively old version without any known major regression like that.
So, my personal preference at this point is that kube-aws sticks with K8S v1.9.3 and just note about the reason we made it so, and provide some guidance for users to configure their clusters to use v1.9.4+.
Would is be better to release with a version like v1.9.6, after both the subpath security vulnerability and regressions were fixed?
The problem release is v1.9.4, since if you upgrade to that release, you'll have a hard time upgrading from that release. So if we release with v1.9.3, then the advice to properly fix the vulnerability should probably be to upgrade directly to v1.9.5/6, do not pass 'Go' but do at all costs pass by v1.9.4 π
v1.10.0 just hit stable, and the release notes the caution that upgrading from v1.9.4 is not supported if you use subpaths, and also recommends only to downgrade to no earlier than v1.9.6.
Since we have the luxury of not having had a v1.9.x release yet, should we could go straight to the fixed version. On the other hand, if v1.9.3 is tested and stable and can release right now, I'd rather have that version and wear the mitigation than have no release.
Hi @mumoshu, I want to discuss some possible changes in CloudFormation especially regarding the NestedStacks and DesiredCapacity.
Had all the clusters upgraded to 1.9.1 last week. All went fine, except on the last cluster, the last NestedStack (a nodepool) failed to upgrade and started a RollBack. The upgrade failed because a new instance was added and didn't respond by any form (no ssh, no ping, nothing in AWS console "Get System Logs"). Since CF is expecting a signal from the launched instances, the quickest solution was to terminate the instance from console but it didn't launched a new one, so I increased the ASG capacity to be able to receive the signal before a timeout but right after that I received a error in CloudFormation:
New SetDesiredCapacity value 20 is below min value 21 for the AutoScalingGroup.
And the RollBack started, rolling back all the node pools, controllers and ETCD for ~2h. It's not the first time when something like this happens, once the CloudFormation displayed a message that a new instance was added but didn't show in console so I had to change the ASG manually to avoid signal timeout.On our CF stacks we never use
DesiredCapacity
for ASGs because they can be resized due to a traffic increase and CF will rollback the update if the size differs. Then, regarding the NestedStacks, I think ETCD and controllers should be in separated stacks. If the update fails on workers, we should not rollback the ETCD and controllers that were successfully upgraded. Sometimes a rollback will not work for ETCD or controllers with specific versions. For example on test clusters was unable to revert from ETCD 3.2.X to ETCD 3.0.X or from Kubernetes 1.9.X to 1.7.X and this require repeating the upgrade. Workers are quite safe to rollback.Also, the RollBack process for NestedStacks takes a very long time for large clusters with multiple node pools. I know that using Nested Stacks it's much cleaner, but rolling back everything when we have updates in ETCD or Kubernetes version requires repeating the upgrade to recover the cluster, causing a very long downtime.
I' don't know how we can remove the
DesiredCapacity
from CF right now, but at least maybe we can move out ETCD and Controllers from Nested Stacks. What do you think?