okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.74k stars 295 forks source link

static IP configuration apparently broken in 4.7.0-2021-03-07 #547

Closed kai-uwe-rommel closed 3 years ago

kai-uwe-rommel commented 3 years ago

Describe the bug I fell about this new bug during working on another issue: https://github.com/openshift/okd/issues/536

I had done my previous installations of 4.7.0 all with static IP configuration (because that is what all my teams need). That means, I used AfterBurn for the initial boot IP configuration through vSphere VM advanced config settings and through ignition created a NetworkManager config file as well as /etc/hostname and /etc/hosts files. This worked correctly with the OKD 4.7 releases of 02-25 and 03-06.

I did installation( attempts) with the 03-07 release of OKD 4.7 today. With the first one, bootstrap and all three master nodes initially came up with their ignition config fine. After the master nodes had been rebased their FCOS and rebooted, only one of them had a working network config. The other two master nodes had not configured their ens192 and were not accessible via network. I could login through VM console, though. I first thought about some external problem and redid the complete installation. The second time, the same happened but all three masters had no networking after the rebase+reboot. But the config file /etc/NetworkManager/system-connections was there. No idea why it was not applied. I had no time to waste on this (as I was actually working on anoher issue) and so I did a third installation attempt, this time with DHCP for IP configuration. This worked insofar as at least the nodes always had a working network.

Version 4.7.0-0.okd-2021-03-07-090821

How reproducible Apparently, always, see above.

Log bundle Difficult because at this time no log gathering possible... But just let me know what I should gather and I will try again. Of course, it will be difficult to get the logs out of the machines with no working networking ...

vrutkovs commented 3 years ago

Dupe of #536

kai-uwe-rommel commented 3 years ago

@vrutkovs, I am pretty sure we are talking about two different issues!

vrutkovs commented 3 years ago

Lets figure out #536 in any case

kai-uwe-rommel commented 3 years ago

But ... so far I was able to work my way around the #536 problem so far so I would be able to deploy an OKD 4.7 cluster if someone would request it from me, although I would currently rather stay with OKD 4.6.

However, the broken static IP config is a showstopper. I do not have found the reason yet and have no workaround.

kai-uwe-rommel commented 3 years ago

/reopen

openshift-ci-robot commented 3 years ago

@kai-uwe-rommel: Reopened this issue.

In response to [this](https://github.com/openshift/okd/issues/547#issuecomment-794099980): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
kai-uwe-rommel commented 3 years ago

Just for the record, again: the static IP configuration works fine initialially, when the VMs are created, ignited and come up first. But it breaks immediately after the rebase+reboot that happens when OKD deployment starts.

devzeronull commented 3 years ago

Hi,

your issue sounds interesting, could you explain it in more detail, please?

I was able to set up the 4.7 2021-02-25 with static Network configuration without any problems either, and could upgrade to 03-07 seamlessly. I am installing on VMware UPI in bare-metal mode using prepared F.CoreOS ISOs or PXE.

I also use the more stable OVNKubernetes CNI and not OpenShifSDN...

To configure static networking I am appending kernel parameters to each machine in the bootloader during the initial provisioning boot of the machines: rd.neednet=1 ip=NODE_IP::GATEWAY_IP:NETMASK:NODE_FQDN_HOSTNAME:ens192:none nameserver=DNS_SERVER_1 nameserver=DNS_SERVER_2

Would be nice to keep your problems in mind for our automated cluster deployments. Do you have any logs or can say what was exactly breaking and how your setup differs from ours?

Why are you doing this so complicated if you could just append kernel params on initial boot for static network configuration: "That means, I used AfterBurn for the initial boot IP configuration through vSphere VM advanced config settings and through ignition created a NetworkManager config file as well as /etc/hostname and /etc/hosts files."

Thanks for reporting your problems!

kai-uwe-rommel commented 3 years ago

Hello @devzeronull , thanks for your reply. Actually, it is not "so complicated". Afterburn ist also not doing anything else than appending kernel parameters. I was also using modified boot images before, until I discovered that it is much easier using Afterburn and that it relieves me from modifying the boot images: https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-static-ip-config-with-ova That link came from here where it is described in more detail: https://www.openshift.com/blog/how-to-install-openshift-4.6-on-vmware-with-upi This works fine in RHCOS with 4.6 and newer as well as in FCOS since later version 32 builds.

Now what I am not sure about yet: does the kernel parameter string for IP config only work during initial (ignition) boot or does it always work afterwards, too? I guess it will continue to work later, too, but it is a bit limited. You can for example not specify DNS search suffixes. That's why I during ignition simply create a config file in /etc/NetworkManager/system-connections for the ens192 interface (this is very easy). Also, I set the hostname statically in /etc/hostname because some previous FCOS release had broken this. That's all there is to it. It is easy to do and easily integrated (automated) into ignition and deployment scripts.

The problem I now have with OKD 4.7 on FCOS33, at least with the 2021-03-07 build of OKD: After the FCOS rebase and reboot which is the first step in the OKD deployment on the FCOS nodes, my static IPs are simply ignored. That is, the VM comes up but does not have network connectivity. I can log in via VM console but there is no IP config set. I was able to set up OKD 4.7 with my way of configuring static IPs with the 2021-02-25 release.

It may perhaps have to do something with a change in OKD 4.7's overlay networking (of course I use OVNKubernetes). I have looked at an OKD cluster which is configured with DHCP and which I upgraded to OKD 4.7. There I saw that the DHCP-assigned IP is no longer set on "ens192" but magically moved to a bridge interface "br-ex". I have not seen anything documented about this change yet. Do you know more about this?

Your sample kernel param string also configures ens192. Are you sure this still works with your 03-07 cluster?

kai-uwe-rommel commented 3 years ago

I did another deployment attempt right now: I removed my code to generate a .nmconnection file for ens192 via ignition and only added the static IP config via kernel args (e.g. Afterburn). The result is that this (due to the absense of any .nmconnection file) apparently automatically generates a default_connection.nmconnection file with the data from the kernel args. But it does not change the problem. After the FCOS rebase step and the reboot, two out of my three master nodes do not have network configuration. I'm puzzled that one has it (all three are created exactly identically) and have no explanation why.

I can only report the problem here and hope that someone from the developer team can pick up or give me hints, ask for checks etc. ...

kai-uwe-rommel commented 3 years ago

BTW, even on the master node where the IP config survived, I get (on the other two masters as well) this: [systemd] Failed Units: 1 ovs-configuration.service

And even that one master node that has IP connectivity the deployment does not progress but hangs, too.

The journalctl log for this service shows:

-- Reboot --
Mar 11 10:53:59 master-03.kur-test.ars.de systemd[1]: Starting Configures OVS with proper host networking configuration...
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[920]: + rpm -qa
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[921]: + grep -q openvswitch
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' OVNKubernetes == OVNKubernetes ']'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' -d /etc/NetworkManager/system-connections-merged ']'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + NM_CONN_PATH=/etc/NetworkManager/system-connections-merged
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + iface=
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=0
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 0 -lt 12 ']'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[950]: ++ ip route show default
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[951]: ++ awk '{ if ($4 == "dev") { print $5; exit } }'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + iface=ens192
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + [[ -n ens192 ]]
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + echo 'IPv4 Default gateway interface found: ens192'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: IPv4 Default gateway interface found: ens192
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + break
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' ens192 = br-ex ']'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' -z ens192 ']'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + iface_mac=00:50:56:a5:ec:1d
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + echo 'MAC address found for iface: ens192: 00:50:56:a5:ec:1d'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: MAC address found for iface: ens192: 00:50:56:a5:ec:1d
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[954]: ++ ip link show ens192
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[955]: ++ awk '{print $5; exit}'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + iface_mtu=1500
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + [[ -z 1500 ]]
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + echo 'MTU found for iface: ens192: 1500'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: MTU found for iface: ens192: 1500
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[957]: ++ nmcli --fields UUID,DEVICE conn show --active
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[958]: ++ awk '/\sens192\s*$/ {print $1}'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + old_conn=445bc266-9f75-422d-bb47-f094f65c5d8d
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + extra_brex_args=
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[962]: ++ nmcli --get-values ipv4.dhcp-client-id conn show 445bc266-9f75-422d-bb47-f094f65c5d8d
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + dhcp_client_id=
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' -n '' ']'
Mar 11 10:53:59 master-03.kur-test.ars.de configure-ovs.sh[966]: ++ nmcli --get-values ipv6.dhcp-duid conn show 445bc266-9f75-422d-bb47-f094f65c5d8d
Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[918]: + dhcp6_client_id=
Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' -n '' ']'
Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli connection show br-ex
Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli c add type ovs-bridge con-name br-ex conn.interface br-ex 802-3-ethernet.mtu 1500 802-3-eth>Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[974]: Connection 'br-ex' (1f37631e-796d-46d1-9314-9ddce069b91f) successfully added.
Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli connection show ovs-port-phys0
Mar 11 10:54:00 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli c add type ovs-port conn.interface ens192 master br-ex con-name ovs-port-phys0
...skipping...
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=5
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 5 -lt 5 ']'
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + echo 'WARN: OVS did not succesfully activate NM connection. Attempting to bring up connections'
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: WARN: OVS did not succesfully activate NM connection. Attempting to bring up connections
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=0
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 0 -lt 5 ']'
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn up ovs-if-br-ex
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[1185]: Error: unknown connection 'ovs-if-br-ex'.
Mar 11 10:54:25 master-03.kur-test.ars.de configure-ovs.sh[918]: + sleep 5
Mar 11 10:54:30 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=1
Mar 11 10:54:30 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 1 -lt 5 ']'
Mar 11 10:54:30 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn up ovs-if-br-ex
Mar 11 10:54:30 master-03.kur-test.ars.de configure-ovs.sh[1193]: Error: unknown connection 'ovs-if-br-ex'.
Mar 11 10:54:30 master-03.kur-test.ars.de configure-ovs.sh[918]: + sleep 5
Mar 11 10:54:35 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=2
Mar 11 10:54:35 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 2 -lt 5 ']'
Mar 11 10:54:35 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn up ovs-if-br-ex
Mar 11 10:54:35 master-03.kur-test.ars.de configure-ovs.sh[1200]: Error: unknown connection 'ovs-if-br-ex'.
Mar 11 10:54:35 master-03.kur-test.ars.de configure-ovs.sh[918]: + sleep 5
Mar 11 10:54:40 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=3
Mar 11 10:54:40 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 3 -lt 5 ']'
Mar 11 10:54:40 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn up ovs-if-br-ex
Mar 11 10:54:40 master-03.kur-test.ars.de configure-ovs.sh[1205]: Error: unknown connection 'ovs-if-br-ex'.
Mar 11 10:54:40 master-03.kur-test.ars.de configure-ovs.sh[918]: + sleep 5
Mar 11 10:54:45 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=4
Mar 11 10:54:45 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 4 -lt 5 ']'
Mar 11 10:54:45 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn up ovs-if-br-ex
Mar 11 10:54:45 master-03.kur-test.ars.de configure-ovs.sh[1210]: Error: unknown connection 'ovs-if-br-ex'.
Mar 11 10:54:45 master-03.kur-test.ars.de configure-ovs.sh[918]: + sleep 5
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + counter=5
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + '[' 5 -lt 5 ']'
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + echo 'ERROR: Failed to activate ovs-if-br-ex NM connection'
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: ERROR: Failed to activate ovs-if-br-ex NM connection
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + set +e
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn down ovs-if-br-ex
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[1217]: Error: 'ovs-if-br-ex' is not an active connection.
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[1217]: Error: no active connection provided.
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn down ovs-if-phys0
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[1221]: Connection 'ovs-if-phys0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkMan>Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + nmcli conn up 445bc266-9f75-422d-bb47-f094f65c5d8d
Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[1236]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnec>Mar 11 10:54:50 master-03.kur-test.ars.de configure-ovs.sh[918]: + exit 1
Mar 11 10:54:50 master-03.kur-test.ars.de systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
Mar 11 10:54:50 master-03.kur-test.ars.de systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Mar 11 10:54:50 master-03.kur-test.ars.de systemd[1]: Failed to start Configures OVS with proper host networking configuration.
kai-uwe-rommel commented 3 years ago

So it says unknown connection 'ovs-if-br-ex' When I look into /etc/NetworkManager/system-connections, I see these files there:

-rw-------. 1 root root 345 Mar 11 10:54 br-ex.nmconnection
-rw-------. 1 root root 435 Mar 11 10:46 default_connection.nmconnection
-rw-------. 1 root root 282 Mar 11 10:54 ovs-if-phys0.nmconnection
-rw-------. 1 root root 168 Mar 11 10:54 ovs-port-br-ex.nmconnection
-rw-------. 1 root root 169 Mar 11 10:54 ovs-port-phys0.nmconnection

So a bug due to name mismatch? Missing file?

kai-uwe-rommel commented 3 years ago

When I look on the master where I happen to have IP connectivity, I see:

[root@master-03 system-connections]# nmcli conn                             
NAME              UUID                                  TYPE        DEVICE  
Wired Connection  445bc266-9f75-422d-bb47-f094f65c5d8d  ethernet    ens192  
br-ex             1f37631e-796d-46d1-9314-9ddce069b91f  ovs-bridge  br-ex   
ovs-port-br-ex    21cf18e8-f9de-479d-b68c-27fb54f7e581  ovs-port    br-ex   
ovs-port-phys0    680aedf8-2265-42ac-95b7-2e909fdd48dd  ovs-port    ens192  
ovs-if-phys0      a3248f34-01fb-4d9c-98fd-2066cc9a1919  ethernet    --      

On one of the other masters where I don't have that, I see: grafik

So there is the difference. Here I do not have ens192 in the "Wired Connection" for whatever reason. The reason might be the different UUIDs and their alphabetical order. Where I have IP, the UUID of Wired Connection is "lower", otherwise that of ovs-if-phys0 is lower. So pure luck.

But that seems to be wrong either way?

kai-uwe-rommel commented 3 years ago

This is how it looks like on a master node of a cluster with OKD 4.6 2021-02-14 which was installed with DHCP:

[core@master-01 ~]$ nmcli conn
NAME              UUID                                  TYPE           DEVICE
ovs-if-br-ex      4ac8cd31-699b-445b-b28b-cb4ea0ca6295  ovs-interface  br-ex
br-ex             a73cb2f7-1459-473b-be0b-accf4bb2a0b5  ovs-bridge     br-ex
ovs-if-phys0      640ae954-bb3a-4af9-ad87-c3503429a8ed  ethernet       ens192
ovs-port-br-ex    b5be2d0d-cafd-47c3-a320-e4d25ccb9392  ovs-port       br-ex
ovs-port-phys0    9ce62b13-f4a2-4696-855d-53414d2bb42e  ovs-port       ens192
Wired Connection  7c2b48f2-3587-4e5e-8f29-3aee250340d5  ethernet       --

This one actually has the ovs-if-br-ex interface that the 4.7 cluster's master node complains about in the journalctl log.

kai-uwe-rommel commented 3 years ago

I have just deployed a cluster with 4.7 2021-02-25 successfully (except for the https://github.com/openshift/okd/issues/536 issue). My static IP assignment worked fine. So it is definitely broken in the 4.7 2021-03-07 release, nothing wrong with my environment.

I have not tried the 03-06 release again. That the 03-07 release was released only one day later somehow implies to me that the 03-06 release is somehow poisoned and should be avoided?

kai-uwe-rommel commented 3 years ago

I have later upgraded the 4.7 2021-02-25 cluster to 2021-03-07 and the static IP assignment (via kernel args) still works. So the 2021-03-07 release of OKD 4.7 can still work correctly with static IP assignment. But the initial installation process for OKD 4.7 2021-03-07 seems to have a bug in the sense that it fails like described above when the node VMs are started with static IP assignment via kernel args.

kai-uwe-rommel commented 3 years ago

Here is a sample journalctl log from one of the nodes, from the rpm-ostree rebase before the reboot. It shows error messages that might be related to this interface issue or not. journalctl-no-ip.txt

kai-uwe-rommel commented 3 years ago

This is how these error messages look like. I'm not convinced they are related to the IP assignment / missing interface problem but just for the record:

Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: (2021-03-16 19:33:52:015549): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: (2021-03-16 19:33:52:016155): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: (2021-03-16 19:33:52:016719): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: (2021-03-16 19:33:52:016999): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: (2021-03-16 19:33:52:017055): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5540]: (2021-03-16 19:33:52:017202): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5536]: groupadd: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5536]: groupadd: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: (2021-03-16 19:33:52:064778): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: (2021-03-16 19:33:52:065274): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: (2021-03-16 19:33:52:066013): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: (2021-03-16 19:33:52:066266): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: (2021-03-16 19:33:52:066345): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5542]: (2021-03-16 19:33:52:066424): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5536]: groupadd: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5536]: groupadd: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: (2021-03-16 19:33:52:152882): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: (2021-03-16 19:33:52:153295): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: (2021-03-16 19:33:52:153811): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: (2021-03-16 19:33:52:154118): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: (2021-03-16 19:33:52:154198): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5547]: (2021-03-16 19:33:52:154261): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5544]: useradd: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5544]: useradd: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: (2021-03-16 19:33:52:194511): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: (2021-03-16 19:33:52:194848): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: (2021-03-16 19:33:52:194957): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: (2021-03-16 19:33:52:195815): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: (2021-03-16 19:33:52:195882): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5551]: (2021-03-16 19:33:52:195934): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5544]: useradd: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5544]: useradd: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: (2021-03-16 19:33:52:284207): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: (2021-03-16 19:33:52:284630): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: (2021-03-16 19:33:52:285396): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: (2021-03-16 19:33:52:285926): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: (2021-03-16 19:33:52:285999): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5557]: (2021-03-16 19:33:52:286069): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5554]: groupadd: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5554]: groupadd: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: (2021-03-16 19:33:52:327728): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: (2021-03-16 19:33:52:328147): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: (2021-03-16 19:33:52:328675): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: (2021-03-16 19:33:52:328928): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: (2021-03-16 19:33:52:329049): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5559]: (2021-03-16 19:33:52:329123): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5554]: groupadd: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5554]: groupadd: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: (2021-03-16 19:33:52:400409): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: (2021-03-16 19:33:52:400836): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: (2021-03-16 19:33:52:401392): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: (2021-03-16 19:33:52:401685): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: (2021-03-16 19:33:52:401738): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5563]: (2021-03-16 19:33:52:401955): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5560]: usermod: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5560]: usermod: Failed to flush the sssd cache.
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: (2021-03-16 19:33:52:443040): [sss_cache] [ldb] (0x0020): Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: (2021-03-16 19:33:52:443404): [sss_cache] [ldb] (0x0020): Failed to connect to '/var/lib/sss/db/config.ldb' with backend 'tdb': Unable to open tdb '/var/lib/sss/db/config.ldb': No such file or directory
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: (2021-03-16 19:33:52:443955): [sss_cache] [confdb_init] (0x0010): Unable to open config database [/var/lib/sss/db/config.ldb]
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: (2021-03-16 19:33:52:444184): [sss_cache] [init_domains] (0x0020): Could not initialize connection to the confdb
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: Could not open available domains
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: (2021-03-16 19:33:52:444270): [sss_cache] [init_context] (0x0040): Initialization of sysdb connections failed
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5566]: (2021-03-16 19:33:52:444418): [sss_cache] [main] (0x0020): Error initializing context for the application
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5560]: usermod: sss_cache exited with status 5
Mar 16 19:33:52 master-01.kur-test2.ars.de rpm-ostree(openvswitch.prein)[5560]: usermod: Failed to flush the sssd cache.
devzeronull commented 3 years ago

Hi kai-uwe-rommel, thank you very much for describing your problems and experiences, at least I am currently very busy so didn't find the time to answer appropriately, but I was reading everything so far :) Using vmware + afterburn sounds like an interesting approach, I will take a closer look to it - Thanks!

I can also confirm that with the current stable 4.7.0-2021-03-07 release the static configuration via kernel params works without problems also in an initial installation using that release and latest CoreOS - I tested that yesterday...

Best regards

kai-uwe-rommel commented 3 years ago

@devzeronull , how exactly do you set the kernel arguments? Previously (before switching to Afterburn), I modified the VM images and wrote the IP config into the ignition.firstboot file (e.g. with set ignition_network_kcmdline=...). Is that also what you do (your previous message sounds like that)? Or yet another approach? Do you then also add a .nmconnection file into /etc/NetworkManager/system-connections?

If you can tell me how exactly you do it, I would like to test that out if it makes a difference for my setup process. Although it would really be a pity to drop Afterburn, which is really a very elegant solution (and has worked fine so far).

kai-uwe-rommel commented 3 years ago

I tried falling back on configuring static IPs with modifying the ignition.firstboot file instead of using Afterburn, but the result is the same - failure. Something really weird is going on here. I would really like to know how our setups differ.

devzeronull commented 3 years ago

If you can tell me how exactly you do it, I would like to test that out if it makes a difference for my setup process. Although it would really be a pity to drop Afterburn, which is really a very elegant solution (and has worked fine so far).

Hi, I am attaching the following directly to the CoreOS Bootloader to parametrize Dracut/CoreOS Installers Network Configuration: rd.neednet=1 ip=NODE_IP::GATEWAY_IP:NETMASK:NODE_FQDN_HOSTNAME:ens192:none nameserver=DNS_SERVER_1 nameserver=DNS_SERVER_2 which you can do by hand or in the isolinux.cfg of the boot-image and generate boot ISOs using tools like geniso on Linux.

Steps to generate an customized ISO for e.g. an bootstrap node are:

Using this method you can than easily automate the provisioning process with working static IP setup if you generate the ISO on-the-fly and boot it using PXE/TFTP.

In addition to that you could of course combine it with individual Ignition configs and machine sets...

kai-uwe-rommel commented 3 years ago

I understand your approach. I think it is more suitable for bare metal installations. It would of course also work for vSphere installations but it is more complicated and sacrifices the benefits of vSphere. I can check if I find time to try this out. However, meanwhile I firmly believe that I do not have a problem in my environment or process. I have this problem/failure ONLY with the 4.7 2021-03-07 release. EVERY OTHER release, including even the 4.7 2021-03-06 release (!) work fine with EVERYTHING ELSE UNCHANGED. E.g. in the same environment and with the same installation process. Only 4.7 2021-03-07 fails. So some new problem was introduced with it. I can now of course simply give up and wait and see if the next release is working correctly under this aspect again. But that may happen or not. I think it is better that I report the problem here so that someone from the team can pick up and fix the problem if it is not just a glitch that will not happen with the next release again. The problem is, will someone from the team a) believe me and b) invest the time to fix it?

kai-uwe-rommel commented 3 years ago

I can report that the rpm-ostree error messages I reported above yesterday do not seem to relate with the static IP problem. These messages also appear when installing the 03-06 release instead of the 03-07 release. But with the 03-06 release the static IP assignment works fine.

devzeronull commented 3 years ago

Yes, I conclude that this is not the optimal provisioning process but in the end the result is the same. Since we are running in an security aware environment we can sadly not allow an "application" to control our infrastructure/Hypervisor - so there is no alternative to bare-metal for us :(

Anyways, I will stay informed on your raised issue and interested in the solution!

kai-uwe-rommel commented 3 years ago

@devzeronull, which version of FCOS are you using for your initial VM creation?

kai-uwe-rommel commented 3 years ago

It looks like I can reproduce the same problem in the same configuration with OCP 4.7.2 (this has worked before in OCP 4.7.0).

fortinj66 commented 3 years ago

as discussed in slack, this is hopefully fixed for OKD in releases: https://origin-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.7.0-0.okd/release/4.7.0-0.okd-2021-03-22-172926 and later

kai-uwe-rommel commented 3 years ago

Yes, I can confirm that I was able to install a new OKD cluster with this release and static IP configuration. Now we just need the next stable build with these fixes (and no other new bugs ...). :-)