Non-backwards compatible change: CP using NLB

bellis-ai commented 1 year ago

In a recent update, the control plane was changed to use a NLB instead of a classic load balancer. Those upgrading the module to the latest version will find the following error take place

CP Loadbalancer is upgraded to NLB
Target groups are created for the NLB
Autoscaling group has "ignore_changes" on "load_balancer" and "target_groups" property, and ignores the changes for the latest load balancer.
Because the autoscaling group does not have the correct load balancer settings, instances are not automatically assigned to the NLB and the control plane fails.

Not sure how to fix.

bellis-ai commented 1 year ago

Looks like it's a matter of just changing the Autoscaling group to use the new NLB, importing it back into state, and then adding the security groups (ends in -cp) to each control plane instance

adamacosta commented 1 year ago

I haven't yet worked out how to handle a migration gracefully to the NLB, but beware that if you just update in-place, the new load balancer will have a different DNS name from the old one and that will invalidate the server certificate being served by the Kube api-server, which will have the DNS name of the old load balancer in its SAN list, placed in there automatically by our module.

I believe, but have not yet tested, that what you have to do is:

Create the new NLB outside of Terraform first
Grab the DNS name from AWS and add it to the TLS san list in the /etc/rke2/config.yaml files on all of the control plane nodes
Cycle rke2 on the control plane nodes to generate a new certificate that will include this
Import the load balancer into Terraform state
Then do the rest of what you're saying above

Alternatively, if you have a custom URL and DNS record for the api-server and already included that in the TLS san list, none of this will matter.

bellis-ai commented 1 year ago

Thank you so much! I was just encountering this problem when trying to cycle out the old master nodes -- none were joining the cluster! I'll try this now.

bellis-ai commented 1 year ago

@adamacosta When you say Cycle rke2 on the control plane nodes to generate a new certificate that will include this What do you mean exactly? Restart the systemctl service? How do I cycle rke2? I am not very experienced in manual deployment of RKE2, so I'd like to know what I need to restart

adamacosta commented 1 year ago

Yes, run systemctl restart rke2-server on each control plane node, after editing the config.yaml file. That should generate a new certificate with the added TLS san for the new load balancer in it. Then new nodes should be able to join after that.

bellis-ai commented 1 year ago

I feel like something's missing. Any ping to 9345 after the config change and rke2-server restart results in a TLS error for "SSL23_GET_SERVER_HELLO". (pings to 6443 still go through). I feel like there's a cert I'm missing here...

bellis-ai commented 1 year ago

So it looks like the changes are indeed propagated to serving-kube-apiserver.crt, but whatever cert is being used for the supervisor does not change out. I have no idea how to force change it.

bellis-ai commented 1 year ago

Figured it out. You have to invalidate the cached certificate data by deleting /var/lib/rancher/rke2/server/tls/dynamic-cert.json. No idea why this isn't done automatically when the certificate data is different.

adamacosta commented 1 year ago

Hey, thanks for figuring that out. Apologies for not following this better. I did get around to trying this out and it worked fine for me in terms of hitting the api server, but I only ran it on a single host, so the supervisor process would have been unused anyway. I'm not going to close this right away because we should put this in a real migration doc.

rancherfederal / rke2-aws-tf

Non-backwards compatible change: CP using NLB #90