Closed brynjellis-iit closed 3 months ago
Adding this image (6665dab0d2781f39a26f5dd800350e95.jpg) which shows the changes that happen to the master node Hostname and InternalDNS values during the upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703.
I spun up a new cluster to get the must-gather and noticed something. During the upgrade to 4.14.0-0.okd-2023-11-12-042703 there are some pending CSR's for a new control plane node!! When I accepted them it ended up creating a 4th control-plane node that has the suffix of the dhcp option set 'domain-name' value.
oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-115-20-120.eu-west-2.compute.internal Ready worker 3h2m v1.26.9+636f2be
ip-10-115-20-24.eu-west-2.compute.internal NotReady,SchedulingDisabled control-plane,master 3h13m v1.26.9+636f2be
ip-10-115-20-24.ice-aws.cloud Ready control-plane,master 2m58s v1.27.6+b49f9d1
ip-10-115-21-158.eu-west-2.compute.internal Ready worker 3h4m v1.26.9+636f2be
ip-10-115-21-62.eu-west-2.compute.internal Ready control-plane,master 3h13m v1.26.9+636f2be
ip-10-115-22-124.eu-west-2.compute.internal Ready control-plane,master 3h13m v1.26.9+636f2be
ip-10-115-22-137.eu-west-2.compute.internal Ready worker 3h4m v1.26.9+636f2be
I'll continue to try to get the must-gather output but I've a feeling it isn't going to be able to get everything properly because of the incomplete state of the upgrade.
Attaching 2 must-gathers. One after fresh install of 4.13.0-0.okd-2023-10-28-065448 and one after an attempt to upgrade to 4.14.0-0.okd-2023-11-12-042703. Their filenames should be self explanatary. must-gather-4.13-pre-upgrade.tgz must-gather-4.14-failed-upgrade.tgz
Hi,
We are not working on FCOS builds of OKD any more. Please see these documents...
https://okd.io/blog/2024/06/01/okd-future-statement https://okd.io/blog/2024/07/30/okd-pre-release-testing
Please test with the OKD SCOS nightlies and file a new issue as needed.
Many thanks,
Jaime
2 days ago I had to rebuild a cluster (thankfully I hadn't released it yet for live deployments). The reason being during an upgrade from 4.13.0-0.okd-2023-10-28-065448 to 4.14.0-0.okd-2023-11-12-042703 one of the masters started complaining that it was unable to register the node with the API server.
On inspection, I could see that the name of the node had changed. Except it hadn't!!!
I hit this problem a year ago but it wasn't as serious because it was on a worker, not a master.
I'm running on AWS and using IPI and when the initial install is complete all my nodes have the names ip-x-x-x-x.eu-west-2.compute.internal. I should say also, my VPC uses a DHCP option set where the 'domain-name' is set to some.domain.
When I look at the [kubernetes.io/hostname] label on the nodes they are set to ip-x-x-x-x.some.domain. When the issue arises where the node can't register with the API server, the error message says:
kubelet_node_status.go:72] "Attempting to register node" node="ip-10-38-20-100.some.domain" 1623Jan 17 12:00:32.044690 ip-10-38-20-100 kubenswrapper[1408]: E0117 12:00:32.044672 1408 kubelet_node_status.go:94] "Unable to register node with API server" err="nodes \"ip-10-38-20-100.some.domain\" is forbidden: node \"ip-10-38-20-100.eu-west-2.compute.internal\" is not allowed to modify node \"ip-10-38-20-100.some.domain\"" node="ip-10-38-20-100.some.domain"
So, suddenly, a node that was clearly registered at install with the name 'eu-west-2.compute.internal' has picked up the domain from the DHCP option set 'some.domain' and can't register because presumably all the configuration held in etcd and config files is referring to the original name it was given.
I should also mention that since it's original installation, I performed 3 x 4.13 upgrades to take it from 4.13.0-0.okd-2023-06-04-080300 to 4.13.0-0.okd-2023-10-28-065448 and they were all successful.
During testing, I decided to try the install (of 4.13.0-0.okd-2023-10-28-065448) and upgrade (to 4.14.0-0.okd-2023-11-12-042703) into a VPC that uses the default 'domain-name' value in the dhcp option set. When I did this, all the upgrades completed successfully all the way up to the latest 4.15 on the stable channel.
That confirmed to me the issue is around a change in the way the hostname for masters/workers is obtained in the later versions.
Version 4.14.0-0.okd-2023-11-12-042703 on AWS using IPI
How reproducible 100%
Log bundle I've destroyed the cluster (for cost reasons) so will build a new one, reproduce the problem and attach a must-gather here when done.