rancherfederal / rke2-aws-tf

MIT License
85 stars 67 forks source link

Non-backwards compatible change: CP using NLB #90

Open bellis-ai opened 1 year ago

bellis-ai commented 1 year ago

In a recent update, the control plane was changed to use a NLB instead of a classic load balancer. Those upgrading the module to the latest version will find the following error take place

Not sure how to fix.

bellis-ai commented 1 year ago

Looks like it's a matter of just changing the Autoscaling group to use the new NLB, importing it back into state, and then adding the security groups (ends in -cp) to each control plane instance

adamacosta commented 1 year ago

I haven't yet worked out how to handle a migration gracefully to the NLB, but beware that if you just update in-place, the new load balancer will have a different DNS name from the old one and that will invalidate the server certificate being served by the Kube api-server, which will have the DNS name of the old load balancer in its SAN list, placed in there automatically by our module.

I believe, but have not yet tested, that what you have to do is:

Alternatively, if you have a custom URL and DNS record for the api-server and already included that in the TLS san list, none of this will matter.

bellis-ai commented 1 year ago

Thank you so much! I was just encountering this problem when trying to cycle out the old master nodes -- none were joining the cluster! I'll try this now.

bellis-ai commented 1 year ago

@adamacosta When you say Cycle rke2 on the control plane nodes to generate a new certificate that will include this What do you mean exactly? Restart the systemctl service? How do I cycle rke2? I am not very experienced in manual deployment of RKE2, so I'd like to know what I need to restart

adamacosta commented 1 year ago

Yes, run systemctl restart rke2-server on each control plane node, after editing the config.yaml file. That should generate a new certificate with the added TLS san for the new load balancer in it. Then new nodes should be able to join after that.

bellis-ai commented 1 year ago

I feel like something's missing. Any ping to 9345 after the config change and rke2-server restart results in a TLS error for "SSL23_GET_SERVER_HELLO". (pings to 6443 still go through). I feel like there's a cert I'm missing here...

bellis-ai commented 1 year ago

So it looks like the changes are indeed propagated to serving-kube-apiserver.crt, but whatever cert is being used for the supervisor does not change out. I have no idea how to force change it.

bellis-ai commented 1 year ago

Figured it out. You have to invalidate the cached certificate data by deleting /var/lib/rancher/rke2/server/tls/dynamic-cert.json. No idea why this isn't done automatically when the certificate data is different.

adamacosta commented 1 year ago

Hey, thanks for figuring that out. Apologies for not following this better. I did get around to trying this out and it worked fine for me in terms of hitting the api server, but I only ran it on a single host, so the supervisor process would have been unused anyway. I'm not going to close this right away because we should put this in a real migration doc.