okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.74k stars 295 forks source link

node hangs (FCOS) during cluster deployment and later node reboots #536

Closed kai-uwe-rommel closed 3 years ago

kai-uwe-rommel commented 3 years ago

Describe the bug Today I installed the first OKD 4.7 cluster. I had installed many 4.5 and 4.6 cluster before and also several OCP clusters. I needed two attempts for the installation and even in the second attempt succeeded only after manual intervention.

My first attempt timed out in the bootstrap process and looking for the reason I found that one of the master nodes somehow hung. It was reachable via IP but I could not ssh into it nor log in via VM console, any login was refused because of "nologin" state. I reset the VM and after booting up, I could ssh into it. However, at this time the cluster bootstrap had already time out.

For the second attempt I started over from scratch (well, initial VM snapshots) and monitored the progess closely. After the FCOS rebase and reboot, once again one of the master nodes (this time a different one) was stuck in "nologin" state and did not progress. So I again reset the VM and after it came up again it was no longer locked and also progressed so the cluster bootstrap finally succeeded.

After the cluster installation was complete (3 workers), I always have a defined automated customization process that includes various items like replacing API and app ingress certificates, configuring LDAP authentication and so on. Some of the steps (like the certificate replacement) cause the machine config operator to update and then reboot all nodes. During such a process, one worker node also hung after rebooting in "nologin" state. It became accessible and progressed only after I again reset this VM.

So I assume, such "nologin" hangs/lockouts of nodes could happen anytime later whenever node reboots are performed.

Of course, I cannot predictably reproduce the problem, given this experience... but the problem appeared way too often.

What log files do you need? Will a must-gather help for this at all?

Version 4.7.0-0.okd-2021-02-25 vSphere UPI

How reproducible Not explicitly reproducible, but often. See above.

Log bundle Please advise if this would be helpful.

vrutkovs commented 3 years ago

also several OCP clusters

Is it the same error in OCP?

kai-uwe-rommel commented 3 years ago

No. I had done an OCP 4.7 installation a couple of days ago and it went flawlessly.

vrutkovs commented 3 years ago

Please attach log bundle

kai-uwe-rommel commented 3 years ago

(I have to find some spare time first ...)

bobby0724 commented 3 years ago

I am having issues with 4.6 and 4.7 the installation just wont finish If i use 4.5 install completed flawlessly every time

bobby0724 commented 3 years ago

I was able to finally install OKD 4.7 yesterday, and I am wondering if the fedora repos are having some kind of limits on downloading, what I did is a wait 24 hours after my last install and doing that I was able to successfully install 4.7 without intervention

kai-uwe-rommel commented 3 years ago

I did another deployment today ... and could not reproduce the problem. Either I was just lucky or when I installed on Tuesday there was some outage in the quay.io repository services so the deployment hung because of that? I can do another test tomorrow.

kai-uwe-rommel commented 3 years ago

I did yet another deployment today and now have the problem again. I'll attempt to get the logs now ...

But: I fear we won't get the node log from the one that failed because it does not allow any ssh access which I assume is also needed to gather logs:

"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Connection closed by 193.149.36.204 port 22

Any suggestion what I should do to get more log data?

kai-uwe-rommel commented 3 years ago

Unfortunately - no go. The "gather bootstrap" simply hangs and does nothing.

[root@nfs install]# openshift-install gather bootstrap --dir=/opt/install --bootstrap bootstrap.kur-test.ars.de --master master-01.kur-test.ars.de
INFO Pulling debug logs from the bootstrap machine

For over an hour now ... I can access the bootstrap node and two masters via SSH but the failed master would be the interesting one ... :-(

kai-uwe-rommel commented 3 years ago

I have reset the hanging worker and got its previous (--boot=-1) journalctl. It boils down to millions of such lines: Mar 05 17:24:46 master-03.kur-test.ars.de hyperkube[1132]: E0305 17:24:46.477219 1132 pod_workers.go:191] Error syncing pod 05435c0e-1367-47b6-806b-94fbea96046a ("installer-2-master-03.kur-test.ars.de_openshift-kube-scheduler(05435c0e-1367-47b6-806b-94fbea96046a)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

So something went wrong there.

kai-uwe-rommel commented 3 years ago

journalctl-of-failed-master.zip

kai-uwe-rommel commented 3 years ago

I then got the control plane up and running after resetting the master-03. Then got the workers configuring and again one of the three hung. I reset this as well and got it done. Afterwards, I also got the previous boot's journalctl log and found the same error messages in it like for the master above.

kai-uwe-rommel commented 3 years ago

I did another deployment test today (with the 03-06 release). This time one master hung in a different way. I was able to ssh into it but it did not progress. After logging in, I saw:

Last login: Sun Mar  7 07:12:37 2021 from 192.168.53.72            
[systemd]                                                          
Failed Units: 3                                                    
  etc-NetworkManager-system\x2dconnections\x2dmerged.mount         
  console-login-helper-messages-gensnippet-os-release.service      
  console-login-helper-messages-gensnippet-ssh-keys.service      

I rebooted it then it came up without these errors and configured itself successfully. I have a journalctl output log from the previous boot if you are interested.

kai-uwe-rommel commented 3 years ago

On Version 4.7.0-0.okd-2021-02-25-144700 i just update my ntp servers all nodes get rebooted and master-1 is just stuck there with no ip, i have try 20 different things reset restart power off , snapshot, power off, nothing works and cluster is unusable because of this this dhcp thing is really bad

@bobby0724 your issue sounds like something very different than what I reported here. I suggest you open a separate issue.

vrutkovs commented 3 years ago

For over an hour now ... I can access the bootstrap node and two masters via SSH but the failed master would be the interesting one ... :-(

Does

--bootstrap <bootstrap_address> \ 
    --master <master_1_address> \ 

make any difference here?

kai-uwe-rommel commented 3 years ago

But that's what I did? Pass it the address the of the bootstrap node and of one of the master nodes.

vrutkovs commented 3 years ago

I can access the bootstrap node and two masters

Was it one of the masters you can reach via ssh? Also, make sure your install has 3 masters (it won't work with 2 masters)

kai-uwe-rommel commented 3 years ago

Yes, the master I tried was one of the two working masters. I wanted to try this first as I suspect that the log gathering will not work to/from the hanging master anyway. Sure I have three masters.

Did you see my message from yesterday with the case where I was able to access the master but it was still stuck and had the three failed systemd units? I captured its journal log. It may give us a hint what's happening. logs20210307.zip

vrutkovs commented 3 years ago

it was still stuck and had the three failed systemd units?

This doesn't tell me anything without knowing general info

kai-uwe-rommel commented 3 years ago

Ok, but what can we do? Any suggestion? Then I can try another deployment and see what I can get. But when "gather bootstrap" does not work generally, I see no other option?

vrutkovs commented 3 years ago

https://docs.okd.io/latest/installing/installing-troubleshooting.html#installation-manually-gathering-logs-with-SSH_installing-troubleshooting

kai-uwe-rommel commented 3 years ago

Yes, I know this page. While I can use this for getting the data from the bootstrap host and annother (accessible) master node I cannot do this directly on the hanging master host because I can not SSH into it. All I can do is reset that VM and when I am able to access it again, I can pick up the previous boot's journalctl log. Would that help?

vrutkovs commented 3 years ago

Yes, I know this page.

Would you mind providing this information then?

I can not SSH into it.

MCS logs from bootstrap would tell us if it has requested an ignition or its just stuck there

kai-uwe-rommel commented 3 years ago

I'll start a new deployment tonight or tomorrow and then try to get what is possible then. The cluster I was referring in my previous messages from the weekend is already gone by now (24 hour limit).

When I encountered such a stuck master node, it was already beyond the ignition. The node did definitely boot and get the ignition. Then it was stuck after the reboot following the FCOS rebase (I mentioned this in my opening description).

Also, I had the same problem even much later with some arbitrary nodes (workers) after full and successful installation, on some subsequent reboot.

kai-uwe-rommel commented 3 years ago

I did a new installation attempt today. This time I used the 03-07 release (the last attempt above last weekend was done with the 03-06 release). Unfortunately, I have to report that things got worse with that release again. The problem I have reported with this issue still persists. But there is a new one. Apparently static IP configuration is now broken with the 03-07 release.

I had done the above installations all with static IP configuration (because that is what all my teams need). That means, I used AfterBurn for the initial boot IP configuration through vSphere VM advanced config settings and through ignition created a NetworkManager config file as well as /etc/hostname and /etc/hosts files.

I actuall did not just one installation attempts today. With the first one, bootstrap and all three master nodes initially came up with their ignition config fine. After the master nodes had been rebased their FCOS and rebooted, only one of them had a working network config. The other two master nodes had not configured their ens192 and were note accessible via network. I could login through VM console, though. I first thought about some external problem and redid the complete installation. This second time, the same happened but all three masters had no networking after the rebase+reboot. But the config file /etc/NetworkManager/system-connections was there. No idea why it was not applied. I had no time to waste on this and so I did a third installation attempt, this time with DHCP for IP configuration. This worked insofar as at least the nodes always had a working network. Then I finally got to the actual problem we are working on here ...

So then the same happened. One of the masters (master-02) hung in this state where it did not allow any login. As before, the "openshift-install gather bootstrap" just hung and did nothing. I assume this tool is just broken. So I collected the listed logs manually via SSH from the bootstrap and master-01 and master-03 nodes. I then reset the master-02 VM and was then able to SSH into master-02. I could then only collect the journalctl log from the previous boot and otherwise the machine finally progressed and completed successfully. Later on, when the worker nodes initialized, one out of four also hung after the rebase+reboot and I had to reset the VM. Then this also completed successfully.

I'm attaching the logs I was able to collect here.

I assume I should open another issue for the now broken static IP configuration that is new with the OKD 4.7 release of 03-07. logs20210309.zip

vrutkovs commented 3 years ago

through ignition created a NetworkManager config file as well as /etc/hostname and /etc/hosts files.

That's wrong, you should be configuring NM to create these files for you. Why would you need to amend /etc/hosts?

vrutkovs commented 3 years ago

So then the same happened. One of the masters (master-02) hung in this state where it did not allow any login. ... I then reset the master-02 VM and was then able to SSH into master-02

So, the issue is transient and doesn't happen on every boot?

kai-uwe-rommel commented 3 years ago

through ignition created a NetworkManager config file as well as /etc/hostname and /etc/hosts files.

That's wrong, you should be configuring NM to create these files for you. Why would you need to amend /etc/hosts?

I do have to set the hostname statically in /etc/hostname. As for /etc/hosts ... well, this is debatable. But it certainly does not cause problems to have static lines for the local system in it.

kai-uwe-rommel commented 3 years ago

So then the same happened. One of the masters (master-02) hung in this state where it did not allow any login. ... I then reset the master-02 VM and was then able to SSH into master-02

So, the issue is transient and doesn't happen on every boot?

Well, that's what I was trying to express with so many words in the previous comments. :-) It randomly happens to SOME of the nodes (be it masters or workers) and a reboot usually cures it. But it only happens since the release of OKD 4.7 and it happens often enough so that it affected all my deployments so far!

vrutkovs commented 3 years ago

well, this is debatable.

But does this method work? Do living masters get correct IP and hostname?

It randomly happens to SOME of the nodes

You'd need to find out more about this until we can claim its an OKD bug and not specific to your infra

kai-uwe-rommel commented 3 years ago

well, this is debatable.

But does this method work? Do living masters get correct IP and hostname?

I don't understand what you mean. Whether or not I enter the static IPs with the local hostname into /etc/host does not matter. It's just a habit to always guarantee the system can resolve its own name locally even when there is some intermittent DNS outage. The /etc/hosts file has nothing to do with if a node gets IP or not. The IP comes from the /etc/NetworkManager config file. And the hostname comes from /etc/hostname. I had previously relied on the hostname coming from a reverse DNS lookup of a node's IP address but some FCOS release later broke this, so I added automatic creation of /etc/hostname through ignition. I don't know if it is fixed by now. I rather not rely on this, given that I encounter ever more bugs the more I work with the stuff ... :-(

kai-uwe-rommel commented 3 years ago

It randomly happens to SOME of the nodes

You'd need to find out more about this until we can claim its an OKD bug and not specific to your infra

What I can say for sure is that I did NEVER have this problem (the one we are handling in this issue) with OKD 4.5 or 4.6 but encountered it AS SOON as I did the first OKD 4.7 installation.

It may well be a problem of the FCOS stream that is now underlying OKD 4.7 if it is different than the one coming with OKD 4.6. But how should I tell, with the limited diagnostic tools I have for OKD and FCOS? Help me figuring this out and I will gladly do so ...

And the infrastructure here is pretty basic. Just a simple vSphere cluster and a LAN with DHCP (if wanted) or without it.

jomeier commented 3 years ago

@kai-uwe-rommel: I had huge problems on vSphere with promiscuous mode turned on with OVNKubernetes since OKD 4.6. The NetworkManager seems to be not happy with that.

Maybe it's simpler, if we try to look in your setup in a live debug session?

kai-uwe-rommel commented 3 years ago

Yes, we can do so! I have a test environment where I can quickly start deployment of a new cluster where we hopefully can see the problem at the first attempt. I'll send you an e-mail message.

kai-uwe-rommel commented 3 years ago

During a debug session I had with @jomeier yesterday, we've seen the problem twice. Interestingly, once it was already before the actual OKD deployment (e.g. rpm-ostree rebase) of the node. So it is probably a problem of th FCOS release. But since it also happens after the OKD stream for FCOS is installed, it seems to be a more generic problem, not that of a single release. Any idea how we could tackle this? I'm attaching a journalctl of such a case. I can see in there that an SSH login for "core" is denied but not why ... Also, lots of networking errors.

journalctl-no-login.txt.gz

lvlts commented 3 years ago

This happens also on vSphere IPI, either during installation, or even after the cluster is deployed. If a node (e.g. master1, worker nodes too) is being rebooted, half of the time it will not manage to get an IP address from the DHCP server. tcpdump shows DHCP server process and reply to the discover request, and assigning IP.

My workaround is to simply "reset" or reboot the node until I can ping it. Then everything goes back to normal.

Edit: what I am experiencing is exactlysimilar to the issue described in https://github.com/coreos/fedora-coreos-tracker/issues/757

kai-uwe-rommel commented 3 years ago

@lvlts, I don't think you see the problem that I described in this issue. In my case, the nodes are either getting DHCP leases perfectly or are even configured with static IP addresses.

The symptom of my issue is that the node is booting up, gets its correct IP address (either DHCP reservation or static configuration), is pingable and I can connect to it via SSH (e.g. sshd is running and answering) but I cannot log in but I am greeted with this message: "System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)." and then my ssh session is dropped. Also if the node is still in the deployment process, it will not progress. So more things inside it don't work but I cannot check what's up because I cannot ssh into it and console login fails with the same message.

jomeier commented 3 years ago

@kai-uwe-rommel @lvlts Could you try this:

In my setup this seems to solve the problem:

 # sudo vi /etc/systemd/network/98-ovs-mac.link
 [Match]
 Driver=openvswitch

 [Link]
 MACAddressPolicy=none

https://bugzilla.redhat.com/show_bug.cgi?id=1936961

kai-uwe-rommel commented 3 years ago

@jomeier we should make sure we don't mix issues. I don't see any DHCP problems and have such "hanging nodes" even when statically configuring IP addresses. Whenever I see the problem, the nodes have perfect network connectivity. To me it looks like systemd does not progress starting up the system so it never reaches "multiuser" state. Unfortunately, since I can't get into such a system yet, I don't know where exactly it hangs. I can only reboot and then it usually works (I had only one case so far where I had this twice in a row).

kai-uwe-rommel commented 3 years ago

I just tried and it is possible to set a root password in FCOS and log in with it on the console. I'll try to get a root password into my ignition setup so that I can try to debug when the problem happens the next time.

fortinj66 commented 3 years ago

As discussed in slack, hopefully https://origin-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.okd/release/4.7.0-0.okd-2021-03-22-172926 and later have the fix for this

kai-uwe-rommel commented 3 years ago

I did an installation with this release yesterday and had no hangs. Usually, I have at least one such hang during each cluster installation. But occassionally I got through an installation also without one. So it may bee too early to definitely declare this as solved but probability is quite high. :-)

kai-uwe-rommel commented 3 years ago

I'd say this is solved.