Closed kai-uwe-rommel closed 3 months ago
On Jan. 31st I upgraded an OKD-Cluster from 4.14.0-0.okd-2024-01-06-084517 to 4.14.0-0.okd-2024-01-26-175629 and later on the same day to 4.15.0-0.okd-2024-01-27-070424. I just saw that since yesterday it could not start any new pods. These hang and in the pod descriptions I can see:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 3m50s (x701 over 3h6m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nova-gui-mr-496-was9-69559f9677-wmdqm_nova-gui_39f0159e-47ee-4c85-9b65-fc6e07bfd5ab_0(033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc): error adding pod nova-gui_nova-gui-mr-496-was9-69559f9677-wmdqm to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc Netns:/var/run/netns/ebc56b87-5fdb-4175-9b3a-254397058f27 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=nova-gui;K8S_POD_NAME=nova-gui-mr-496-was9-69559f9677-wmdqm;K8S_POD_INFRA_CONTAINER_ID=033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc;K8S_POD_UID=39f0159e-47ee-4c85-9b65-fc6e07bfd5ab Path: StdinData:........deleted............]} ContainerID:"033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc" Netns:"/var/run/netns/ebc56b87-5fdb-4175-9b3a-254397058f27" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=nova-gui;K8S_POD_NAME=nova-gui-mr-496-was9-69559f9677-wmdqm;K8S_POD_INFRA_CONTAINER_ID=033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc;K8S_POD_UID=39f0159e-47ee-4c85-9b65-fc6e07bfd5ab" Path:"" ERRORED: error configuring pod [nova-gui/nova-gui-mr-496-was9-69559f9677-wmdqm] networking: Multus: [nova-gui/nova-gui-mr-496-was9-69559f9677-wmdqm/39f0159e-47ee-4c85-9b65-fc6e07bfd5ab]: error waiting for pod: Get "https://api-int.devqs.ars.de:6443/api/v1/namespaces/nova-gui/pods/nova-gui-mr-496-was9-69559f9677-wmdqm?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority'
Something about certificate trust. Most of the pods failed on one worker node. So I rebooted this to see if the restart on it of everything would get something refreshed to fix the issue. The node stayed NotReady and in the kubelet logs I see lots of such lines:
Feb 09 15:53:12 worker-03.devqs.ars.de kubenswrapper[2216]: I0209 15:53:12.367814 2216 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://api-int.devqs.ars.de:6443/apis/storage.k8s.io/v1/csinodes/worker-03.devqs.ars.de": tls: failed to verify certificate: x509: certificate signed by unknown authority
And no containers are started on the node at all. I checked the certificate chain of the api-int.devqs.ars.de endpoint and found this:
Certificate chain
0 s:CN = api-int.devqs.ars.de
i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
v:NotBefore: Feb 8 16:18:43 2024 GMT; NotAfter: Mar 9 16:18:44 2024 GMT
-----BEGIN CERTIFICATE-----
MIIDjDCCAnSgAwIBAgIIdCb6ZJTU0fgwDQYJKoZIhvcNAQELBQAwUzFRME8GA1UE
...
sZBei3CpA9g1Eb19koHnijTJUGx66UF+jZSBKayohqJ1MRVcoVw32jsk600U904/
-----END CERTIFICATE-----
1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
v:NotBefore: Jan 31 10:12:10 2024 GMT; NotAfter: Jan 28 10:12:11 2034 GMT
-----BEGIN CERTIFICATE-----
MIIDizCCAnOgAwIBAgIIFez1jUvGUT4wDQYJKoZIhvcNAQELBQAwUzFRME8GA1UE
...
9eBgx/50FUYPrptnlwxK+R8qzspMgTBO2stomgUQgTfKHGF61JOYcbz6hBE1cjA=
-----END CERTIFICATE-----
---
Server certificate
subject=CN = api-int.devqs.ars.de
issuer=CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
So the certificate of the endpoint was created yesterday (and about that time the trouble begun). And the (internal) CA certificate is dated Jan. 31st, e.g. when the cluster was upgraded. So perhaps that was the first rollover of that api-int certificate after the cluster upgrade.
That looks a bit different - in OCP bugs kubelet had wrong api-int CA and so the nodes were NotReady.
Check if removing multus certs - on disk in /etc/cni/net.d
- makes it issue a new correct ones.
its weird that api-int CA was reissued at all - its duration is 10 years and meant to be refreshed at 8.
The worker node that I rebooted is NotReady. And I fear that other nodes would become that as well once I reboot these.
In /etc/cni/net.d on the node I only see a few JSON files, not certs. Did you mean /etc/cni/multus/certs instead? I can remove all files there ... but multus pod/container does not run so it will not do anything, right?
The primary problem at the moment is that on the dead node the kubelet does not work. See message above. I can't register itself with the API server and thus does not start any CRIO container.
With my limited knowledge of the internals I think the kubelet needs to get the new api-int CA into its trust store, e.g. on the node OS. How could I do this?
Did you mean /etc/cni/multus/certs instead? I can remove all files there
yes, correct. Lets not remove anything yet, on referenced OCP bug it was sufficient to update kubelet kubeconfig's CA
I think the kubelet needs to get the new api-int CA into its trust store
Yes, https://github.com/openshift/machine-config-operator/pull/4106
Over night one more node (one of the master nodes) became NotReady as well.
@vrutkovs reading that PR I think I sort of understand what they are talking about but can't gather what to do with my failed cluster right now... The /var/lib/kubelet/kubeconfig on the node is still the one from the initial cluster creation on March 10 2021. So I would need to replace the certificate-authority-data in this file. I guess some MC was supposed to do this but didn't. Where is the api-int CA certificate stored that I would need to put there?
Overall there seem to be two problems:
Looking at the CMs in openshift-kube-apiserver namespace I see:
kube-apiserver-server-ca 1 2y337d
kube-apiserver-server-ca-622 1 10d
kube-apiserver-server-ca-623 1 7d17h
kube-apiserver-server-ca-624 1 7d17h
kube-apiserver-server-ca-625 1 7d17h
kube-apiserver-server-ca-626 1 5d7h
kube-apiserver-server-ca-627 1 17h
kube-apiserver-server-ca-628 1 8h
kube-root-ca.crt 1 2y327d
kubelet-serving-ca 1 2y337d
kubelet-serving-ca-622 1 10d
kubelet-serving-ca-623 1 7d17h
kubelet-serving-ca-624 1 7d17h
kubelet-serving-ca-625 1 7d17h
kubelet-serving-ca-626 1 5d7h
kubelet-serving-ca-627 1 17h
kubelet-serving-ca-628 1 8h
So it is one of these two? The 2y337d ones are from initial cluster creation. Then I see the rolled over ones from 10 days ago when the upgrade to 4.15 was done. But then it seems have rolled over several times again. So is there a third problem? When I pick the rigth one of these now - wouldn't the problem come back quickly again if it is rolling over again?
Where is the api-int CA certificate stored that I would need to put there?
I checked the loadbalancer-serving-ca CM and it contains five (!) CA certificates. One is the original one from 03/10/2021 and the other four are all dated from 01/31/2024 10:12:00 e.g. when I upgraded the cluster. These four all differ ... it's the topmost one that signed the current api-int CA. I can give this a try and put this one into the kubeconfig file.
I had the same problem. I added the new api-int ca cert (which is currently used) to /var/lib/kubelet/kubeconfig and restarted kubelet. Then I had to add this new ca-cert to kube-root-ca.crt configmap in openshift-multus namespace.
After adding the new api-int ca to /var/lib/kubelet/kubeconfig on the one NotReady worker node, that came up as Ready again. Meanwhile I had two NotReady master nodes and one more NotReady worker node. Doing the same procedure on them did not help, unfortunately. The kubelet on these still logs "unknown authority". This is very strange because the ca-bundle.crt in kubeconfig definitely contains the issueing CA.
One of the two NotReady master nodes became ready again. But still one master and one worker are stuck in NotReady despite the new api-int signing CA being present in the kubeconfig ca-bundle. Also the cluster has over 100 pods stuck in either Terminating or ContainerCreating state. Those which do not start have the FailedCreatePodSandBox error (Multus). I have already added the new CA to the kube-root-ca.crt CM in openshift-multus. It looks like this triggered a restart of the multus pods but these themselves are stuck now. It looks like my cluster may already be damaged too much. :-(
The other NotReady worker node is also Ready again. But one master insists on remaining NotReady. The reason is that whenever I restart the kubelet on it, it resets the /var/lib/kubelet/kubeconfig file to the previous content. So I add the new CA to it, restart the kubelet and that resets the file. Regardless if I also do a systemctl daemon-reload. What could cause this? Where does it get the old file content from?
It looks like it resets the /var/lib/kubelet/kubeconfig from /etc/kubernetes/kubeconfig and there the new CA was not yet present. I added it there and now it persists. Unfortunately the master node still stays NotReady. Now the kubelet logs (among many other things but I think this is relevant):
Feb 11 12:12:40 master-01.devqs.ars.de kubenswrapper[3873]: E0211 12:12:40.265448 3873 kubelet_node_status.go:95] "Unable to register node with API server" err="nodes is forbidden: User \"system:anonymous\" cannot create resource \"nodes\" in API group \"\" at the cluster scope" node="master-01.devqs.ars.de"
I'm stumped now.
The good news is that the cluster can now start new pods on all other nodes. If only I could get this master node working again ... Anyone have an idea?
The solution is there: https://issues.redhat.com/browse/OCPBUGS-25821
It is unclear to me yet, if the now released OCP 4.15 contains fixes for the two issues that led to the mess and if the two following OKD release (or at least the latest) do contain fixes.
Here is a must-gather from the now fixed cluster: https://domino.ars.de/file/must-gather.local.4696044669297168135.tar.gz The download link is valid for 7 days from now.
It is unclear to me yet, if the now released OCP 4.15 contains fixes for the two issues that led to the mess and if the two following OKD release (or at least the latest) do contain fixes.
I upgraded an older (repeatedly upgraded) cluster last week from 4.14.0-0.okd-2024-01-26-175629
to 4.15.0-0.okd-2024-02-23-163410
, and it was affected by the problem - so at least that version was not fully fixed yet. At least the manual steps from https://issues.redhat.com/browse/OCPBUGS-25821?focusedId=24126829&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24126829 seemed to work without further issues.
This is fixed by https://github.com/openshift/cluster-kube-apiserver-operator/pull/1653 in https://github.com/okd-project/okd/releases/tag/4.15.0-0.okd-2024-03-10-010116
Luckily it seems to be affecting only clusters created before 4.7
@vrutkovs, interesting. Indeed of our two own OKD clusters one was originally created with 4.6 (affected) and the other with 4.7 (not affected).
Describe the bug Looks like we hit https://issues.redhat.com/browse/OCPBUGS-25821 Will enter more details in following comment. Would not fit properly here otherwise. Basically, there seems to be a trust anchor problem for internally created certificate for the api-int endpoint. Apparently there is a new internal CA after the upgrade to 4.15 and it is not trusted.
Version 4.15.0-0.okd-2024-01-27-070424
How reproducible one cluster upgraded so far and that one now failed - so 100% ...
Log bundle A must-gather does not run, unfortunately. Let me know if/which logs to collect manually.