openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.42k stars 1.38k forks source link

Bare metal install of masters fails on multi NIC with teaming #4229

Closed koksj closed 3 years ago

koksj commented 3 years ago

Version

$ openshift-install version
openshift-install 4.5.0-0.okd-2020-09-18-202631
built from commit 63200c80c431b8dbaa06c0cc13282d819bd7e5f8
release image quay.io/openshift/okd@sha256:5fd1fe9707a9a4f53c8ccafad0cf44824a3a0b51e197f3fbc98d0884a9ddcf4f

Platform:

Baremetal

Please specify: UPI (semi-manual installation on customised infrastructure)

What happened?

Bare metal install of masters fails on multi NIC HP DL380 servers booting from PXE. Masters display a message as below on the console:

ignition[1133]: GET error: Get https://api-int.rdp.centilliard.io:22623/config/master: dial tcp: lookup api-int.rdp.centilliard.io on [::1]53: read udp [::1]:37007->[::1]:53: read: connection refused

What you expected to happen?

Scussefull installation of the 3 masters.

How to reproduce it (as minimally and precisely as possible)?

PXE boot file:

label 10 menu label ^10) FCOS Install Bootstrap kernel fcos/vmlinuz append team=team0:eno1,eno2:mode=active-backup ip=192.168.2.139::192.168.2.254:255.255.255.0:bootstrap.centilliard.io:team0:none nameserver=192.168.2.253 initrd=fcos/initrd.img,fcos/rootfs.img console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.inst.image_url=http://192.168.2.253/fcos/fedora-coreos-32.20200824.3.0-metal.x86_64.raw.xz coreos.inst.ignition_url=http://192.168.2.253/bootstrap.ign

label 11 menu label ^11) FCOS Install Master01 kernel fcos/vmlinuz append team=team0:enp2s0f0,enp2s0f1,enp3s0f0,enp3s0f1:mode=active-backup ip=192.168.2.140::192.168.2.254:255.255.255.0:master01.centilliard.io:team0:none nameserver=192.168.2.253 initrd=fcos/initrd.img,fcos/rootfs.img console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.inst.image_url=http://192.168.2.253/fcos/fedora-coreos-32.20200824.3.0-metal.x86_64.raw.xz coreos.inst.ignition_url=http://192.168.2.253/master.ign

$ your-commands-here

Anything else we need to know?

Installation using a single NIC per server with no teaming specified as below is successful:

PXE BOOT file:

label 10 menu label ^10) FCOS Install Bootstrap kernel fcos/vmlinuz append ip=192.168.2.139::192.168.2.254:255.255.255.0:bootstrap.centilliard.io:eno1:none nameserver=192.168.2.253 initrd=fcos/initrd.img,fcos/rootfs.img console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.inst.image_url=http://192.168.2.253/fcos/fedora-coreos-32.20200824.3.0-metal.x86_64.raw.xz coreos.inst.ignition_url=http://192.168.2.253/bootstrap.ign

label 11 menu label ^11) FCOS Install Master01 kernel fcos/vmlinuz append ip=192.168.2.140::192.168.2.254:255.255.255.0:master01.centilliard.io:enp2s0f0:none nameserver=192.168.2.253 initrd=fcos/initrd.img,fcos/rootfs.img console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=/dev/sda coreos.inst.image_url=http://192.168.2.253/fcos/fedora-coreos-32.20200824.3.0-metal.x86_64.raw.xz coreos.inst.ignition_url=http://192.168.2.253/master.ign

References

koksj commented 3 years ago

I have since come across the blog post "Advanced Network customizations for OpenShift Install". Please see" https://www.openshift.com/blog/advanced-network-customizations-for-openshift-install

I tried customising the bootstrap node's network with the files attached (the .txt extension is only to upload). The config is written to /etc/sysconfig/network-scripts/ for each file but is not loaded by the bootstrap node on rebooting. Instead it grabs two DHCP address from the network for each nic. See the below the output for "ip address":

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:1d:09:68:48:0c brd ff:ff:ff:ff:ff:ff altname enp3s0 inet 192.168.2.102/24 brd 192.168.2.255 scope global dynamic noprefixroute eno1 valid_lft 41874sec preferred_lft 41874sec inet6 fe80::53c0:e5c5:29a7:6e5/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:1d:09:68:48:0e brd ff:ff:ff:ff:ff:ff altname enp7s0 inet 192.168.2.37/24 brd 192.168.2.255 scope global dynamic noprefixroute eno2 valid_lft 41875sec preferred_lft 41875sec inet6 fe80::9caf:6ed4:42f9:cb34/64 scope link noprefixroute valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:de:30:f3:e7 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0

ifcfg-eno1.txt ifcfg-eno2.txt ifcfg-team0.txt

openshift-bot commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot commented 3 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/4229#issuecomment-787952839): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.