okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

Nodes become unhealthy after from 4.11 to 4.12 #1961

Closed hamidostad closed 2 months ago

hamidostad commented 4 months ago

The previous cluster version was 4.11.0-0.okd-2022-12-02-145640. After upgrade the cluster to each version of 4.12, The Nodes become unhealthy. When we check the nodes, we find out the EC2 is not healthy too. When we check EC2 and its services, we faced error in networkmanager that doesn't assign IP to the instance and also, kubelet service is not running. Finally the error shows issue is relatted to ovsdb-server. The user and group "openvswitch:hugetlbfs" is not exist on the instance and it cause failing the ovsdb-server and openvswitch. When we create mentioned user and group, the problem is solved. The question is: Why upgrading to 4.12 version causes this problem? The cluster doesn't have this issue when upgrade patches in 4.11 version.

ovsdb-server log

Jul 01 08:01:48 localhost.localdomain sh[1726]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 01 08:01:48 localhost.localdomain sh[1731]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 01 08:01:48 localhost.localdomain sh[1732]: /usr/bin/chown: invalid user: ‘openvswitch:hugetlbfs’
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1763]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1764]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1766]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1768]: setpriv: failed to parse reuid: ''
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1770]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1771]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1773]: id: 'openvswitch': no such user
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1775]: setpriv: failed to parse reuid: ''
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1776]: install: invalid user 'openvswitch'
Jul 01 08:01:48 localhost.localdomain ovsdb-server[1778]: ovs|00001|daemon_unix|EMER|(null): user openvswitch not found, abort>
Jul 01 08:01:48 localhost.localdomain ovs-ctl[1778]: ovsdb-server: (null): user openvswitch not found, aborting.

Cluster upgrade history

Screenshot 2024-07-02 at 15 56 24

Version

from: 4.11.0-0.okd-2022-12-02-145640 to: 4.12.0-0.okd-2023-03-18-084815

How to reproduce

oc adm upgrade --to="4.12.0-0.okd-2023-03-18-084815"

JaimeMagiera commented 2 months ago

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement https://okd.io/blog/2024/07/30/okd-pre-release-testing

We will be providing documentation on upgrading clusters from 4.15 FCOS to 4.16 SCOS. In terms of clusters that are older, you may be able to get help from community members. I'll convert this to a discussion to facilitate that.

Many thanks,

Jaime