Closed mpk166 closed 1 year ago
I ran into this in my lab today.
Observations:
When creating a custom cluster network, Harvester builds the network interfaces (example custom ClusterNetwork "app"):
# ip addr show app-bo; ip addr show app-br
48: app-bo: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master app-br state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff
38: app-br: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff
Upon initial setup of the custom ClusterNetwork, the Host Network shows as Ready
However, harvester does not re-instantiate the interfaces when the host is rebooted.
# ip addr show app-bo; ip addr show app-br
Device "app-bo" does not exist.
Device "app-br" does not exist.
After reboot, Host Network looks like this:
The other host in the lab still shows the app-bo and app-br interfaces, but now has the same "invalid vlanconfig" error the first node has (Ready with an exclamation point).
As long as one host is up, the bridge will get recreated on the node that was rebooted when a VM is migrated to that host.
If all nodes are down (or it is a single node harvester cluster), harvester will set the bridge back up when a VM requests it, however will not set up the bond automatically as would be expected. Manually adding back the bond at the OS level does not fix connection issues.
Work around...manually adding the bridge and bond to /etc/sysconfig/network seems to temporarily fix the issue (until the next outage, since the ifcfg-interface files are removed when the host boots). VM(s) become accessible again.
# cp ~rancher/ifcfg-app-b* /etc/sysconfig/network/
# ifup app-br
app-br up
# ip addr show app-bo; ip addr show app-br
44: app-bo: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master app-br state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff
38: app-br: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff
Once that is done, the hosts become aware of the networks, however, still show an error for the Network.
IFCFG files for reference:
STARTMODE='onboot'
BONDING_MASTER='yes'
BOOTPROTO='none'
POST_UP_SCRIPT="wicked:setup_bond.sh"
BONDING_SLAVE_0='enp2s0f1'
BONDING_MODULE_OPTS='miimon=100 mode=active-backup '
DHCLIENT_SET_DEFAULT_ROUTE='no'
- ifcfg-app-br
STARTMODE='onboot' BOOTPROTO='static' BRIDGE='yes' BRIDGE_STP='off' BRIDGE_FORWARDDELAY='0' BRIDGE_PORTS='app-bo' PRE_UP_SCRIPT="wicked:setup_bridge.sh" POST_UP_SCRIPT="wicked:setup_bridge.sh"
I would expect both the bridge and bond to be recreated when a VM requests them. The "invalid vlanconfig" error is also concerning. I cannot discount the possibility it is just something in my setup, since this is my first harvester install, so any insight would be greatly appreciated.
Support Bundle uploaded for reference: supportbundle_b5129a5e-863d-4b04-800c-f92b773c7f29_2023-01-23T22-29-57Z.zip
cc @yaocw2020 please take a look when you are free, thanks.
FYI:
I was able to add semi-persistence of the interfaces by adding the appropriate sections into /oem/99_custom.yaml for the bond, bridge, and interface. I also made adjustments to the wicked setup_bond.sh and setup_bridge.sh to account for the CustomNetwork.
The hosts still have the "invalid vlanconfig" error. Attaching support bundle based on this new config. supportbundle_b5129a5e-863d-4b04-800c-f92b773c7f29_2023-01-24T20-05-05Z.zip
From the first support-bundle file https://github.com/harvester/harvester/issues/3337#issuecomment-1401100012 (logs\harvester-system\harvester-network-controller-zc6v7), there are continuous error messages, which lead to the observed phenomenon.
2023-01-23T19:40:27.784236593Z I0123 19:40:27.784020 1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
2023-01-23T19:40:27.784299170Z I0123 19:40:27.784187 1 controller.go:133] matchedNodes: [mcharvester01 mcharvester02], h.nodeName: mcharvester02
2023-01-23T19:40:27.784449746Z time="2023-01-23T19:40:27Z" level=info msg="Starting k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition controller"
2023-01-23T19:40:27.784807121Z I0123 19:40:27.784679 1 controller.go:275] add nad:homenet with vid:0 to the list
2023-01-23T19:40:27.984449790Z time="2023-01-23T19:40:27Z" level=error msg="error syncing 'appnet': handler harvester-network-vlanconfig-controller: set up VLAN failed, vlanconfig: appnet, node: mcharvester02, error: failed to get local area from nad default/homenet, error: invalid vlanconfig , requeuing"
2023-01-23T19:40:28.083325673Z I0123 19:40:28.083082 1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
...
2023-01-23T22:29:42.832624467Z I0123 22:29:42.832231 1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
2023-01-23T22:29:42.832707084Z I0123 22:29:42.832329 1 controller.go:133] matchedNodes: [mcharvester01 mcharvester02], h.nodeName: mcharvester02
2023-01-23T22:29:42.832732504Z I0123 22:29:42.832396 1 controller.go:275] add nad:homenet with vid:0 to the list
2023-01-23T22:29:42.846403899Z time="2023-01-23T22:29:42Z" level=error msg="error syncing 'appnet': handler harvester-network-vlanconfig-controller: set up VLAN failed, vlanconfig: appnet, node: mcharvester02, error: failed to get local area from nad default/homenet, error: invalid vlanconfig , requeuing"
kernel log of mcharvester02
has bond on enp2s0f0
, but no bond on enp2s0f1
Jan 23 19:39:02 mcharvester02 kernel: bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Jan 23 19:39:02 mcharvester02 kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Jan 23 19:39:02 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): Enslaving as a backup interface with a down link
Jan 23 19:39:02 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered blocking state
Jan 23 19:39:02 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered disabled state
Jan 23 19:39:05 mcharvester02 kernel: igb 0000:02:00.0 enp2s0f0: igb: enp2s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): link status definitely up, 1000 Mbps full duplex
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): making interface the new active one
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: active interface up!
Jan 23 19:39:05 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered blocking state
Jan 23 19:39:05 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered forwarding state
Jan 23 19:39:05 mcharvester02 kernel: NET: Registered protocol family 17
Jan 23 19:39:08 mcharvester02 kernel: Bridge firewalling registered
Jan 23 19:39:33 mcharvester02 kernel: bpfilter: Loaded bpfilter_umh pid 2598
by contrast, mcharvester01
has app-bo
log, but only at Jan 18 22:24, the later reboot also has no such log
Jan 18 22:24:01 mcharvester01 kernel: app-bo: (slave enp2s0f1): Enslaving as a backup interface with a down link
Jan 18 22:24:04 mcharvester01 kernel: igb 0000:02:00.1 enp2s0f1: igb: enp2s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 18 22:24:04 mcharvester01 kernel: app-bo: (slave enp2s0f1): link status definitely up, 1000 Mbps full duplex
Jan 18 22:24:04 mcharvester01 kernel: app-bo: (slave enp2s0f1): making interface the new active one
Jan 18 22:24:04 mcharvester01 kernel: app-bo: active interface up!
Jan 18 22:25:10 mcharvester01 kernel: scsi host8: iSCSI Initiator over TCP/IP
log of wicked, it shows the NODE rebooting history and setting of management network
-- Reboot --
Jan 22 00:27:28 mcharvester01 systemd[1]: Starting wicked managed network interfaces...
Jan 22 00:27:34 mcharvester01 wicked[1878]: lo up
Jan 22 00:27:34 mcharvester01 wicked[1878]: enp2s0f0 enslaved
Jan 22 00:27:34 mcharvester01 wicked[1878]: mgmt-br up
Jan 22 00:27:34 mcharvester01 wicked[1878]: mgmt-bo enslaved
Jan 22 00:27:34 mcharvester01 systemd[1]: Finished wicked managed network interfaces.
Jan 22 00:27:57 mcharvester01 systemd[1]: Stopping wicked managed network interfaces...
Jan 22 00:27:58 mcharvester01 wicked[2452]: enp2s0f0 device-ready
Jan 22 00:27:58 mcharvester01 systemd[1]: wicked.service: Succeeded.
Jan 22 00:27:58 mcharvester01 systemd[1]: Stopped wicked managed network interfaces.
-- Reboot --
Jan 23 18:14:25 mcharvester01 systemd[1]: Starting wicked managed network interfaces...
Jan 23 18:14:31 mcharvester01 wicked[1910]: lo up
Jan 23 18:14:31 mcharvester01 wicked[1910]: enp2s0f0 enslaved
Jan 23 18:14:31 mcharvester01 wicked[1910]: mgmt-br up
Jan 23 18:14:31 mcharvester01 wicked[1910]: mgmt-bo enslaved
Jan 23 18:14:31 mcharvester01 systemd[1]: Finished wicked managed network interfaces.
-- Reboot --
Jan 22 00:27:23 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 22 00:27:29 mcharvester02 wicked[1851]: lo up
Jan 22 00:27:29 mcharvester02 wicked[1851]: enp2s0f0 enslaved
Jan 22 00:27:29 mcharvester02 wicked[1851]: mgmt-br up
Jan 22 00:27:29 mcharvester02 wicked[1851]: mgmt-bo enslaved
Jan 22 00:27:29 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
Jan 22 00:28:00 mcharvester02 systemd[1]: Stopping wicked managed network interfaces...
Jan 22 00:28:01 mcharvester02 wicked[2381]: enp2s0f0 device-ready
Jan 22 00:28:01 mcharvester02 systemd[1]: wicked.service: Succeeded.
Jan 22 00:28:01 mcharvester02 systemd[1]: Stopped wicked managed network interfaces.
-- Reboot --
Jan 23 18:14:25 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 23 18:14:30 mcharvester02 wicked[1876]: lo up
Jan 23 18:14:30 mcharvester02 wicked[1876]: enp2s0f0 enslaved
Jan 23 18:14:30 mcharvester02 wicked[1876]: mgmt-br up
Jan 23 18:14:30 mcharvester02 wicked[1876]: mgmt-bo enslaved
Jan 23 18:14:30 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
-- Reboot --
Jan 23 19:39:01 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 23 19:39:08 mcharvester02 wicked[1849]: lo up
Jan 23 19:39:08 mcharvester02 wicked[1849]: enp2s0f0 enslaved
Jan 23 19:39:08 mcharvester02 wicked[1849]: mgmt-br up
Jan 23 19:39:08 mcharvester02 wicked[1849]: mgmt-bo enslaved
Jan 23 19:39:08 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
added backport-needed/1.1.2
issue: #3398.
@mpk166 @jmmckenz Thanks for reporting this issue, the root cause is identified and the fix is ongoing.
* [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
The HEP PR is at:
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at: Refer testing step https://github.com/harvester/harvester/issues/3337#issuecomment-1420296855
[x] Is there a workaround for the issue? If so, where is it documented? The workaround is at: https://github.com/harvester/harvester/issues/3337#issuecomment-1420317073
[x] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*
)?
The PR is at:
https://github.com/harvester/network-controller-harvester/pull/68
* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
The PR for the YAML change is at:
The PR for the chart change is at:
* [ ] If labeled: area/ui Has the UI issue filed or ready to be merged?
The UI issue/PR is at:
* [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
The documentation/KB PR is at:
* [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- The automation skeleton PR is at:
- The automation test case PR is at:
* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label
release/obsolete-compatibility
?The compatibility issue is filed at:
Automation e2e test issue: harvester/tests#708
For testing:
Workaround: Delete all untagged networks if possible before rebooting nodes. Or use VLAN 1 to replace untagged network.
Verified fixed on master-213d4af7-head
(02/14) . Close this issue.
After reboot node machine, the untagged Network on VM can remain working as expected.
master-213d4af7-head
(02/14)Create an untagged network under mgmt cluster network.
Start a VM with the untagged network.
Reboot Harvester node
Check the VM network can be connected
Describe the bug: VM can't connect to outside network after host reboots. No ipv4 will be assigned to nic on the vm.
To Reproduce:
Configuration:
Environment:
Workaround tested: