OKD4 upgrades failing in AWS when DHCP Option Set uses user defined 'domain-name'

brynjellis-iit commented 7 months ago

2 days ago I had to rebuild a cluster (thankfully I hadn't released it yet for live deployments). The reason being during an upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703 one of the masters started complaining that it was unable to register the node with the API server.

On inspection, I could see that the name of the node had changed. Except it hadn't!!!

I hit this problem a year ago but it wasn't as serious because it was on a worker, not a master.

I'm running on AWS and using IPI and when the initial install is complete all my nodes have the names ip-x-x-x-x.eu-west-2.compute.internal. I should say also, my VPC uses a DHCP option set where the 'domain-name' is set to some.domain.

When I look at the [kubernetes.io/hostname] label on the nodes they are set to ip-x-x-x-x.some.domain. When the issue arises where the node can't register with the API server, the error message says:

kubelet_node_status.go:72] "Attempting to register node" node="ip-10-38-20-100.some.domain" 1623Jan 17 12:00:32.044690 ip-10-38-20-100 kubenswrapper[1408]: E0117 12:00:32.044672 1408 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes \"ip-10-38-20-100.some.domain\" is forbidden: node \"ip-10-38-20-100.eu-west-2.compute.internal\" is not allowed to modify node \"ip-10-38-20-100.some.domain\"" node="ip-10-38-20-100.some.domain"

So, suddenly, a node that was clearly registered at install with the name 'eu-west-2.compute.internal' has picked up the domain from the DHCP option set 'some.domain' and can't register because presumably all the configuration held in etcd and config files is referring to the original name it was given.

I should also mention that since it's original installation, I performed 3 x 4.13 upgrades to take it from 4.13.0-0.okd-2023-06-04-080300 to 4.13.0-0.okd-2023-10-28-065448 and they were all successful.

During testing, I decided to try the install (of 4.13.0-0.okd-2023-10-28-065448) and upgrade (to 4.14.0-0.okd-2023-11-12-042703) into a VPC that uses the default 'domain-name' value in the dhcp option set. When I did this, all the upgrades completed successfully all the way up to the latest 4.15 on the stable channel.

That confirmed to me the issue is around a change in the way the hostname for masters/workers is obtained in the later versions.

Version 4.14.0-0.okd-2023-11-12-042703 on AWS using IPI

How reproducible 100%

Log bundle I've destroyed the cluster (for cost reasons) so will build a new one, reproduce the problem and attach a must-gather here when done.

brynjellis-iit commented 7 months ago

Adding this image (6665dab0d2781f39a26f5dd800350e95.jpg) which shows the changes that happen to the master node Hostname and InternalDNS values during the upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703.

brynjellis-iit commented 7 months ago

I spun up a new cluster to get the must-gather and noticed something. During the upgrade to 4.14.0-0.okd-2023-11-12-042703 there are some pending CSR's for a new control plane node!! When I accepted them it ended up creating a 4th control-plane node that has the suffix of the dhcp option set 'domain-name' value.

oc get nodes
NAME                                          STATUS                        ROLES                  AGE     VERSION
ip-10-115-20-120.eu-west-2.compute.internal   Ready                         worker                 3h2m    v1.26.9+636f2be
ip-10-115-20-24.eu-west-2.compute.internal    NotReady,SchedulingDisabled   control-plane,master   3h13m   v1.26.9+636f2be
ip-10-115-20-24.ice-aws.cloud                 Ready                         control-plane,master   2m58s   v1.27.6+b49f9d1
ip-10-115-21-158.eu-west-2.compute.internal   Ready                         worker                 3h4m    v1.26.9+636f2be
ip-10-115-21-62.eu-west-2.compute.internal    Ready                         control-plane,master   3h13m   v1.26.9+636f2be
ip-10-115-22-124.eu-west-2.compute.internal   Ready                         control-plane,master   3h13m   v1.26.9+636f2be
ip-10-115-22-137.eu-west-2.compute.internal   Ready                         worker                 3h4m    v1.26.9+636f2be

I'll continue to try to get the must-gather output but I've a feeling it isn't going to be able to get everything properly because of the incomplete state of the upgrade.

brynjellis-iit commented 7 months ago

Attaching 2 must-gathers. One after fresh install of 4.13.0-0.okd-2023-10-28-065448 and one after an attempt to upgrade to 4.14.0-0.okd-2023-11-12-042703. Their filenames should be self explanatary. must-gather-4.13-pre-upgrade.tgz must-gather-4.14-failed-upgrade.tgz

JaimeMagiera commented 3 months ago

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement https://okd.io/blog/2024/07/30/okd-pre-release-testing

Please test with the OKD SCOS nightlies and file a new issue as needed.

Many thanks,

Jaime

okd-project / okd

OKD4 upgrades failing in AWS when DHCP Option Set uses user defined 'domain-name' #1921