Open bellis-ai opened 1 year ago
Looks like it's a matter of just changing the Autoscaling group to use the new NLB, importing it back into state, and then adding the security groups (ends in -cp) to each control plane instance
I haven't yet worked out how to handle a migration gracefully to the NLB, but beware that if you just update in-place, the new load balancer will have a different DNS name from the old one and that will invalidate the server certificate being served by the Kube api-server, which will have the DNS name of the old load balancer in its SAN list, placed in there automatically by our module.
I believe, but have not yet tested, that what you have to do is:
Alternatively, if you have a custom URL and DNS record for the api-server and already included that in the TLS san list, none of this will matter.
Thank you so much! I was just encountering this problem when trying to cycle out the old master nodes -- none were joining the cluster! I'll try this now.
@adamacosta When you say
Cycle rke2 on the control plane nodes to generate a new certificate that will include this
What do you mean exactly? Restart the systemctl service? How do I cycle rke2? I am not very experienced in manual deployment of RKE2, so I'd like to know what I need to restart
Yes, run systemctl restart rke2-server
on each control plane node, after editing the config.yaml file. That should generate a new certificate with the added TLS san for the new load balancer in it. Then new nodes should be able to join after that.
I feel like something's missing. Any ping to 9345 after the config change and rke2-server restart results in a TLS error for "SSL23_GET_SERVER_HELLO". (pings to 6443 still go through). I feel like there's a cert I'm missing here...
So it looks like the changes are indeed propagated to serving-kube-apiserver.crt, but whatever cert is being used for the supervisor does not change out. I have no idea how to force change it.
Figured it out. You have to invalidate the cached certificate data by deleting /var/lib/rancher/rke2/server/tls/dynamic-cert.json
. No idea why this isn't done automatically when the certificate data is different.
Hey, thanks for figuring that out. Apologies for not following this better. I did get around to trying this out and it worked fine for me in terms of hitting the api server, but I only ran it on a single host, so the supervisor process would have been unused anyway. I'm not going to close this right away because we should put this in a real migration doc.
In a recent update, the control plane was changed to use a NLB instead of a classic load balancer. Those upgrading the module to the latest version will find the following error take place
Not sure how to fix.