cluster dead after a several days after upgrading to 4.15.0-0.okd-2024-01-27-070424

kai-uwe-rommel commented 4 months ago

Describe the bug Looks like we hit https://issues.redhat.com/browse/OCPBUGS-25821 Will enter more details in following comment. Would not fit properly here otherwise. Basically, there seems to be a trust anchor problem for internally created certificate for the api-int endpoint. Apparently there is a new internal CA after the upgrade to 4.15 and it is not trusted.

Version 4.15.0-0.okd-2024-01-27-070424

How reproducible one cluster upgraded so far and that one now failed - so 100% ...

Log bundle A must-gather does not run, unfortunately. Let me know if/which logs to collect manually.

kai-uwe-rommel commented 4 months ago

On Jan. 31st I upgraded an OKD-Cluster from 4.14.0-0.okd-2024-01-06-084517 to 4.14.0-0.okd-2024-01-26-175629 and later on the same day to 4.15.0-0.okd-2024-01-27-070424. I just saw that since yesterday it could not start any new pods. These hang and in the pod descriptions I can see:

  Type     Reason                  Age                     From     Message
  ----     ------                  ----                    ----     -------
  Warning  FailedCreatePodSandBox  3m50s (x701 over 3h6m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_nova-gui-mr-496-was9-69559f9677-wmdqm_nova-gui_39f0159e-47ee-4c85-9b65-fc6e07bfd5ab_0(033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc): error adding pod nova-gui_nova-gui-mr-496-was9-69559f9677-wmdqm to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc Netns:/var/run/netns/ebc56b87-5fdb-4175-9b3a-254397058f27 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=nova-gui;K8S_POD_NAME=nova-gui-mr-496-was9-69559f9677-wmdqm;K8S_POD_INFRA_CONTAINER_ID=033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc;K8S_POD_UID=39f0159e-47ee-4c85-9b65-fc6e07bfd5ab Path: StdinData:........deleted............]} ContainerID:"033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc" Netns:"/var/run/netns/ebc56b87-5fdb-4175-9b3a-254397058f27" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=nova-gui;K8S_POD_NAME=nova-gui-mr-496-was9-69559f9677-wmdqm;K8S_POD_INFRA_CONTAINER_ID=033bcd45dae1ec5a17a83dc895f73211ce28314329c3d79d690648bba8db89bc;K8S_POD_UID=39f0159e-47ee-4c85-9b65-fc6e07bfd5ab" Path:"" ERRORED: error configuring pod [nova-gui/nova-gui-mr-496-was9-69559f9677-wmdqm] networking: Multus: [nova-gui/nova-gui-mr-496-was9-69559f9677-wmdqm/39f0159e-47ee-4c85-9b65-fc6e07bfd5ab]: error waiting for pod: Get "https://api-int.devqs.ars.de:6443/api/v1/namespaces/nova-gui/pods/nova-gui-mr-496-was9-69559f9677-wmdqm?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority'

Something about certificate trust. Most of the pods failed on one worker node. So I rebooted this to see if the restart on it of everything would get something refreshed to fix the issue. The node stayed NotReady and in the kubelet logs I see lots of such lines:

Feb 09 15:53:12 worker-03.devqs.ars.de kubenswrapper[2216]: I0209 15:53:12.367814    2216 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://api-int.devqs.ars.de:6443/apis/storage.k8s.io/v1/csinodes/worker-03.devqs.ars.de": tls: failed to verify certificate: x509: certificate signed by unknown authority

And no containers are started on the node at all. I checked the certificate chain of the api-int.devqs.ars.de endpoint and found this:

Certificate chain
 0 s:CN = api-int.devqs.ars.de
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Feb  8 16:18:43 2024 GMT; NotAfter: Mar  9 16:18:44 2024 GMT
-----BEGIN CERTIFICATE-----
MIIDjDCCAnSgAwIBAgIIdCb6ZJTU0fgwDQYJKoZIhvcNAQELBQAwUzFRME8GA1UE
...
sZBei3CpA9g1Eb19koHnijTJUGx66UF+jZSBKayohqJ1MRVcoVw32jsk600U904/
-----END CERTIFICATE-----
 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jan 31 10:12:10 2024 GMT; NotAfter: Jan 28 10:12:11 2034 GMT
-----BEGIN CERTIFICATE-----
MIIDizCCAnOgAwIBAgIIFez1jUvGUT4wDQYJKoZIhvcNAQELBQAwUzFRME8GA1UE
...
9eBgx/50FUYPrptnlwxK+R8qzspMgTBO2stomgUQgTfKHGF61JOYcbz6hBE1cjA=
-----END CERTIFICATE-----
---
Server certificate
subject=CN = api-int.devqs.ars.de
issuer=CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1706695931

So the certificate of the endpoint was created yesterday (and about that time the trouble begun). And the (internal) CA certificate is dated Jan. 31st, e.g. when the cluster was upgraded. So perhaps that was the first rollover of that api-int certificate after the cluster upgrade.

vrutkovs commented 4 months ago

That looks a bit different - in OCP bugs kubelet had wrong api-int CA and so the nodes were NotReady. Check if removing multus certs - on disk in /etc/cni/net.d - makes it issue a new correct ones.

its weird that api-int CA was reissued at all - its duration is 10 years and meant to be refreshed at 8.

kai-uwe-rommel commented 4 months ago

The worker node that I rebooted is NotReady. And I fear that other nodes would become that as well once I reboot these.

In /etc/cni/net.d on the node I only see a few JSON files, not certs. Did you mean /etc/cni/multus/certs instead? I can remove all files there ... but multus pod/container does not run so it will not do anything, right?

The primary problem at the moment is that on the dead node the kubelet does not work. See message above. I can't register itself with the API server and thus does not start any CRIO container.

kai-uwe-rommel commented 4 months ago

With my limited knowledge of the internals I think the kubelet needs to get the new api-int CA into its trust store, e.g. on the node OS. How could I do this?

vrutkovs commented 4 months ago

Did you mean /etc/cni/multus/certs instead? I can remove all files there

yes, correct. Lets not remove anything yet, on referenced OCP bug it was sufficient to update kubelet kubeconfig's CA

I think the kubelet needs to get the new api-int CA into its trust store

Yes, https://github.com/openshift/machine-config-operator/pull/4106

kai-uwe-rommel commented 4 months ago

Over night one more node (one of the master nodes) became NotReady as well.

kai-uwe-rommel commented 4 months ago

@vrutkovs reading that PR I think I sort of understand what they are talking about but can't gather what to do with my failed cluster right now... The /var/lib/kubelet/kubeconfig on the node is still the one from the initial cluster creation on March 10 2021. So I would need to replace the certificate-authority-data in this file. I guess some MC was supposed to do this but didn't. Where is the api-int CA certificate stored that I would need to put there?

Overall there seem to be two problems:

why did the api-int CA rollover at all - too early yet
now that it happened, it was not handled properly

kai-uwe-rommel commented 4 months ago

Looking at the CMs in openshift-kube-apiserver namespace I see:

kube-apiserver-server-ca                    1      2y337d  
kube-apiserver-server-ca-622                1      10d     
kube-apiserver-server-ca-623                1      7d17h   
kube-apiserver-server-ca-624                1      7d17h   
kube-apiserver-server-ca-625                1      7d17h   
kube-apiserver-server-ca-626                1      5d7h    
kube-apiserver-server-ca-627                1      17h     
kube-apiserver-server-ca-628                1      8h      
kube-root-ca.crt                            1      2y327d  
kubelet-serving-ca                          1      2y337d  
kubelet-serving-ca-622                      1      10d     
kubelet-serving-ca-623                      1      7d17h   
kubelet-serving-ca-624                      1      7d17h   
kubelet-serving-ca-625                      1      7d17h   
kubelet-serving-ca-626                      1      5d7h    
kubelet-serving-ca-627                      1      17h     
kubelet-serving-ca-628                      1      8h

So it is one of these two? The 2y337d ones are from initial cluster creation. Then I see the rolled over ones from 10 days ago when the upgrade to 4.15 was done. But then it seems have rolled over several times again. So is there a third problem? When I pick the rigth one of these now - wouldn't the problem come back quickly again if it is rolling over again?

vrutkovs commented 4 months ago

Where is the api-int CA certificate stored that I would need to put there?

See kube-apiserver-internal-load-balancer-serving

kai-uwe-rommel commented 4 months ago

I checked the loadbalancer-serving-ca CM and it contains five (!) CA certificates. One is the original one from 03/10/2021 and the other four are all dated from 01/31/2024 10:12:00 e.g. when I upgraded the cluster. These four all differ ... it's the topmost one that signed the current api-int CA. I can give this a try and put this one into the kubeconfig file.

sitzm commented 4 months ago

I had the same problem. I added the new api-int ca cert (which is currently used) to /var/lib/kubelet/kubeconfig and restarted kubelet. Then I had to add this new ca-cert to kube-root-ca.crt configmap in openshift-multus namespace.

kai-uwe-rommel commented 4 months ago

After adding the new api-int ca to /var/lib/kubelet/kubeconfig on the one NotReady worker node, that came up as Ready again. Meanwhile I had two NotReady master nodes and one more NotReady worker node. Doing the same procedure on them did not help, unfortunately. The kubelet on these still logs "unknown authority". This is very strange because the ca-bundle.crt in kubeconfig definitely contains the issueing CA.

kai-uwe-rommel commented 4 months ago

One of the two NotReady master nodes became ready again. But still one master and one worker are stuck in NotReady despite the new api-int signing CA being present in the kubeconfig ca-bundle. Also the cluster has over 100 pods stuck in either Terminating or ContainerCreating state. Those which do not start have the FailedCreatePodSandBox error (Multus). I have already added the new CA to the kube-root-ca.crt CM in openshift-multus. It looks like this triggered a restart of the multus pods but these themselves are stuck now. It looks like my cluster may already be damaged too much. :-(

kai-uwe-rommel commented 4 months ago

The other NotReady worker node is also Ready again. But one master insists on remaining NotReady. The reason is that whenever I restart the kubelet on it, it resets the /var/lib/kubelet/kubeconfig file to the previous content. So I add the new CA to it, restart the kubelet and that resets the file. Regardless if I also do a systemctl daemon-reload. What could cause this? Where does it get the old file content from?

kai-uwe-rommel commented 4 months ago

It looks like it resets the /var/lib/kubelet/kubeconfig from /etc/kubernetes/kubeconfig and there the new CA was not yet present. I added it there and now it persists. Unfortunately the master node still stays NotReady. Now the kubelet logs (among many other things but I think this is relevant):

Feb 11 12:12:40 master-01.devqs.ars.de kubenswrapper[3873]: E0211 12:12:40.265448 3873 kubelet_node_status.go:95] "Unable to register node with API server" err="nodes is forbidden: User \"system:anonymous\" cannot create resource \"nodes\" in API group \"\" at the cluster scope" node="master-01.devqs.ars.de"

I'm stumped now.

kai-uwe-rommel commented 4 months ago

The good news is that the cluster can now start new pods on all other nodes. If only I could get this master node working again ... Anyone have an idea?

kai-uwe-rommel commented 4 months ago

The solution is there: https://issues.redhat.com/browse/OCPBUGS-25821

kai-uwe-rommel commented 4 months ago

It is unclear to me yet, if the now released OCP 4.15 contains fixes for the two issues that led to the mess and if the two following OKD release (or at least the latest) do contain fixes.

kai-uwe-rommel commented 4 months ago

Here is a must-gather from the now fixed cluster: https://domino.ars.de/file/must-gather.local.4696044669297168135.tar.gz The download link is valid for 7 days from now.

ssams commented 3 months ago

It is unclear to me yet, if the now released OCP 4.15 contains fixes for the two issues that led to the mess and if the two following OKD release (or at least the latest) do contain fixes.

I upgraded an older (repeatedly upgraded) cluster last week from 4.14.0-0.okd-2024-01-26-175629 to 4.15.0-0.okd-2024-02-23-163410, and it was affected by the problem - so at least that version was not fully fixed yet. At least the manual steps from https://issues.redhat.com/browse/OCPBUGS-25821?focusedId=24126829&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24126829 seemed to work without further issues.

vrutkovs commented 3 months ago

This is fixed by https://github.com/openshift/cluster-kube-apiserver-operator/pull/1653 in https://github.com/okd-project/okd/releases/tag/4.15.0-0.okd-2024-03-10-010116

Luckily it seems to be affecting only clusters created before 4.7

kai-uwe-rommel commented 3 months ago

@vrutkovs, interesting. Indeed of our two own OKD clusters one was originally created with 4.6 (affected) and the other with 4.7 (not affected).

okd-project / okd

cluster dead after a several days after upgrading to 4.15.0-0.okd-2024-01-27-070424 #1881