[BUG] Custom network/config is not presistend

mpk166 commented 1 year ago

Describe the bug: VM can't connect to outside network after host reboots. No ipv4 will be assigned to nic on the vm.

To Reproduce:

Created new vm with Ubuntu installation or cloud init image
Restart host
Start vm and try to connect to the outside

Configuration:

Created custom cluster network
Created network config on second nic
Created untagged vm network
Attachted to vm

Environment:

Harvester ISO version: 1.1.1
Single baremetal with 2 nics

Workaround tested:

Manual ipv4 config doesnt resolve this problem (not working)
Created iptables backup before restart of of hosts and restored it (not working)
Remove and recreate vm interface, all cluster netwok/config and vm networks (works)

jmmckenz commented 1 year ago

I ran into this in my lab today.

Observations:

When creating a custom cluster network, Harvester builds the network interfaces (example custom ClusterNetwork "app"):

# ip addr show app-bo; ip addr show app-br
48: app-bo: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master app-br state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff
38: app-br: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff

Upon initial setup of the custom ClusterNetwork, the Host Network shows as Ready

However, harvester does not re-instantiate the interfaces when the host is rebooted.

# ip addr show app-bo; ip addr show app-br
Device "app-bo" does not exist.
Device "app-br" does not exist.

After reboot, Host Network looks like this:
The other host in the lab still shows the app-bo and app-br interfaces, but now has the same "invalid vlanconfig" error the first node has (Ready with an exclamation point).
As long as one host is up, the bridge will get recreated on the node that was rebooted when a VM is migrated to that host.
If all nodes are down (or it is a single node harvester cluster), harvester will set the bridge back up when a VM requests it, however will not set up the bond automatically as would be expected. Manually adding back the bond at the OS level does not fix connection issues.

Work around...manually adding the bridge and bond to /etc/sysconfig/network seems to temporarily fix the issue (until the next outage, since the ifcfg-interface files are removed when the host boots). VM(s) become accessible again.

# cp ~rancher/ifcfg-app-b* /etc/sysconfig/network/
# ifup app-br
app-br          up
# ip addr show app-bo; ip addr show app-br
44: app-bo: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master app-br state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff
38: app-br: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 00:25:90:49:66:7f brd ff:ff:ff:ff:ff:ff

Once that is done, the hosts become aware of the networks, however, still show an error for the Network.

Screenshot from 2023-01-18 16-35-12

IFCFG files for reference:

ifcfg-app-bo


STARTMODE='onboot'
BONDING_MASTER='yes'
BOOTPROTO='none'
POST_UP_SCRIPT="wicked:setup_bond.sh"

BONDING_SLAVE_0='enp2s0f1'

BONDING_MODULE_OPTS='miimon=100 mode=active-backup '

DHCLIENT_SET_DEFAULT_ROUTE='no'

   - ifcfg-app-br

STARTMODE='onboot' BOOTPROTO='static' BRIDGE='yes' BRIDGE_STP='off' BRIDGE_FORWARDDELAY='0' BRIDGE_PORTS='app-bo' PRE_UP_SCRIPT="wicked:setup_bridge.sh" POST_UP_SCRIPT="wicked:setup_bridge.sh"



I would expect both the bridge and bond to be recreated when a VM requests them.  The "invalid vlanconfig" error is also concerning.  I cannot discount the possibility it is just something in my setup, since this is my first harvester install, so any insight would be greatly appreciated.

jmmckenz commented 1 year ago

Support Bundle uploaded for reference: supportbundle_b5129a5e-863d-4b04-800c-f92b773c7f29_2023-01-23T22-29-57Z.zip

w13915984028 commented 1 year ago

cc @yaocw2020 please take a look when you are free, thanks.

jmmckenz commented 1 year ago

FYI:

I was able to add semi-persistence of the interfaces by adding the appropriate sections into /oem/99_custom.yaml for the bond, bridge, and interface. I also made adjustments to the wicked setup_bond.sh and setup_bridge.sh to account for the CustomNetwork.

The hosts still have the "invalid vlanconfig" error. Attaching support bundle based on this new config. supportbundle_b5129a5e-863d-4b04-800c-f92b773c7f29_2023-01-24T20-05-05Z.zip

w13915984028 commented 1 year ago

From the first support-bundle file https://github.com/harvester/harvester/issues/3337#issuecomment-1401100012 (logs\harvester-system\harvester-network-controller-zc6v7), there are continuous error messages, which lead to the observed phenomenon.

2023-01-23T19:40:27.784236593Z I0123 19:40:27.784020       1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
2023-01-23T19:40:27.784299170Z I0123 19:40:27.784187       1 controller.go:133] matchedNodes: [mcharvester01 mcharvester02], h.nodeName: mcharvester02
2023-01-23T19:40:27.784449746Z time="2023-01-23T19:40:27Z" level=info msg="Starting k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition controller"
2023-01-23T19:40:27.784807121Z I0123 19:40:27.784679       1 controller.go:275] add nad:homenet with vid:0 to the list
2023-01-23T19:40:27.984449790Z time="2023-01-23T19:40:27Z" level=error msg="error syncing 'appnet': handler harvester-network-vlanconfig-controller: set up VLAN failed, vlanconfig: appnet, node: mcharvester02, error: failed to get local area from nad default/homenet, error: invalid vlanconfig , requeuing"
2023-01-23T19:40:28.083325673Z I0123 19:40:28.083082       1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
...

2023-01-23T22:29:42.832624467Z I0123 22:29:42.832231       1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
2023-01-23T22:29:42.832707084Z I0123 22:29:42.832329       1 controller.go:133] matchedNodes: [mcharvester01 mcharvester02], h.nodeName: mcharvester02
2023-01-23T22:29:42.832732504Z I0123 22:29:42.832396       1 controller.go:275] add nad:homenet with vid:0 to the list
2023-01-23T22:29:42.846403899Z time="2023-01-23T22:29:42Z" level=error msg="error syncing 'appnet': handler harvester-network-vlanconfig-controller: set up VLAN failed, vlanconfig: appnet, node: mcharvester02, error: failed to get local area from nad default/homenet, error: invalid vlanconfig , requeuing"

w13915984028 commented 1 year ago

kernel log of mcharvester02 has bond on enp2s0f0, but no bond on enp2s0f1

Jan 23 19:39:02 mcharvester02 kernel: bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Jan 23 19:39:02 mcharvester02 kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Jan 23 19:39:02 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): Enslaving as a backup interface with a down link
Jan 23 19:39:02 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered blocking state
Jan 23 19:39:02 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered disabled state
Jan 23 19:39:05 mcharvester02 kernel: igb 0000:02:00.0 enp2s0f0: igb: enp2s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): link status definitely up, 1000 Mbps full duplex
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): making interface the new active one
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: active interface up!
Jan 23 19:39:05 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered blocking state
Jan 23 19:39:05 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered forwarding state
Jan 23 19:39:05 mcharvester02 kernel: NET: Registered protocol family 17
Jan 23 19:39:08 mcharvester02 kernel: Bridge firewalling registered
Jan 23 19:39:33 mcharvester02 kernel: bpfilter: Loaded bpfilter_umh pid 2598

by contrast, mcharvester01 has app-bo log, but only at Jan 18 22:24, the later reboot also has no such log

Jan 18 22:24:01 mcharvester01 kernel: app-bo: (slave enp2s0f1): Enslaving as a backup interface with a down link
Jan 18 22:24:04 mcharvester01 kernel: igb 0000:02:00.1 enp2s0f1: igb: enp2s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 18 22:24:04 mcharvester01 kernel: app-bo: (slave enp2s0f1): link status definitely up, 1000 Mbps full duplex
Jan 18 22:24:04 mcharvester01 kernel: app-bo: (slave enp2s0f1): making interface the new active one
Jan 18 22:24:04 mcharvester01 kernel: app-bo: active interface up!
Jan 18 22:25:10 mcharvester01 kernel: scsi host8: iSCSI Initiator over TCP/IP

w13915984028 commented 1 year ago

log of wicked, it shows the NODE rebooting history and setting of management network

-- Reboot --
Jan 22 00:27:28 mcharvester01 systemd[1]: Starting wicked managed network interfaces...
Jan 22 00:27:34 mcharvester01 wicked[1878]: lo              up
Jan 22 00:27:34 mcharvester01 wicked[1878]: enp2s0f0        enslaved
Jan 22 00:27:34 mcharvester01 wicked[1878]: mgmt-br         up
Jan 22 00:27:34 mcharvester01 wicked[1878]: mgmt-bo         enslaved
Jan 22 00:27:34 mcharvester01 systemd[1]: Finished wicked managed network interfaces.
Jan 22 00:27:57 mcharvester01 systemd[1]: Stopping wicked managed network interfaces...
Jan 22 00:27:58 mcharvester01 wicked[2452]: enp2s0f0        device-ready
Jan 22 00:27:58 mcharvester01 systemd[1]: wicked.service: Succeeded.
Jan 22 00:27:58 mcharvester01 systemd[1]: Stopped wicked managed network interfaces.
-- Reboot --
Jan 23 18:14:25 mcharvester01 systemd[1]: Starting wicked managed network interfaces...
Jan 23 18:14:31 mcharvester01 wicked[1910]: lo              up
Jan 23 18:14:31 mcharvester01 wicked[1910]: enp2s0f0        enslaved
Jan 23 18:14:31 mcharvester01 wicked[1910]: mgmt-br         up
Jan 23 18:14:31 mcharvester01 wicked[1910]: mgmt-bo         enslaved
Jan 23 18:14:31 mcharvester01 systemd[1]: Finished wicked managed network interfaces.

-- Reboot --
Jan 22 00:27:23 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 22 00:27:29 mcharvester02 wicked[1851]: lo              up
Jan 22 00:27:29 mcharvester02 wicked[1851]: enp2s0f0        enslaved
Jan 22 00:27:29 mcharvester02 wicked[1851]: mgmt-br         up
Jan 22 00:27:29 mcharvester02 wicked[1851]: mgmt-bo         enslaved
Jan 22 00:27:29 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
Jan 22 00:28:00 mcharvester02 systemd[1]: Stopping wicked managed network interfaces...
Jan 22 00:28:01 mcharvester02 wicked[2381]: enp2s0f0        device-ready
Jan 22 00:28:01 mcharvester02 systemd[1]: wicked.service: Succeeded.
Jan 22 00:28:01 mcharvester02 systemd[1]: Stopped wicked managed network interfaces.
-- Reboot --
Jan 23 18:14:25 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 23 18:14:30 mcharvester02 wicked[1876]: lo              up
Jan 23 18:14:30 mcharvester02 wicked[1876]: enp2s0f0        enslaved
Jan 23 18:14:30 mcharvester02 wicked[1876]: mgmt-br         up
Jan 23 18:14:30 mcharvester02 wicked[1876]: mgmt-bo         enslaved
Jan 23 18:14:30 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
-- Reboot --
Jan 23 19:39:01 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 23 19:39:08 mcharvester02 wicked[1849]: lo              up
Jan 23 19:39:08 mcharvester02 wicked[1849]: enp2s0f0        enslaved
Jan 23 19:39:08 mcharvester02 wicked[1849]: mgmt-br         up
Jan 23 19:39:08 mcharvester02 wicked[1849]: mgmt-bo         enslaved
Jan 23 19:39:08 mcharvester02 systemd[1]: Finished wicked managed network interfaces.

harvesterhci-io-github-bot commented 1 year ago

added backport-needed/1.1.2 issue: #3398.

w13915984028 commented 1 year ago

@mpk166 @jmmckenz Thanks for reporting this issue, the root cause is identified and the fix is ongoing.

harvesterhci-io-github-bot commented 1 year ago

Pre Ready-For-Testing Checklist

* [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? ~~The HEP PR is at:~~

[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at: Refer testing step https://github.com/harvester/harvester/issues/3337#issuecomment-1420296855
[x] Is there a workaround for the issue? If so, where is it documented? The workaround is at: https://github.com/harvester/harvester/issues/3337#issuecomment-1420317073
[x] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)? The PR is at: https://github.com/harvester/network-controller-harvester/pull/68
- [x] Does the PR include the explanation for the fix or the feature?
* [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? ~~The PR for the YAML change is at:~~ ~~The PR for the chart change is at:~~

* [ ] If labeled: area/ui Has the UI issue filed or ready to be merged? ~~The UI issue/PR is at:~~

* [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? ~~The documentation/KB PR is at:~~

* [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue? ~~- The automation skeleton PR is at:~~ ~~- The automation test case PR is at:~~

* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? ~~The compatibility issue is filed at:~~

harvesterhci-io-github-bot commented 1 year ago

Automation e2e test issue: harvester/tests#708

yaocw2020 commented 1 year ago

For testing:

Create an untagged network under mgmt cluster network.
Start a VM with the untagged network.
Reboot one compute host and the VM network will be normal

yaocw2020 commented 1 year ago

Workaround: Delete all untagged networks if possible before rebooting nodes. Or use VLAN 1 to replace untagged network.

TachunLin commented 1 year ago

Verified fixed on master-213d4af7-head (02/14) . Close this issue.

Result

After reboot node machine, the untagged Network on VM can remain working as expected.

Test Information

Test Environment: 1 nodes Harvester on local kvm amchine
Harvester version: master-213d4af7-head (02/14)

Verify Steps

Create an untagged network under mgmt cluster network.
Start a VM with the untagged network.
Reboot Harvester node
Check the VM network can be connected

harvester / harvester