harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.8k stars 318 forks source link

[BUG] Custom network/config is not presistend #3337

Closed mpk166 closed 1 year ago

mpk166 commented 1 year ago

Describe the bug: VM can't connect to outside network after host reboots. No ipv4 will be assigned to nic on the vm.

To Reproduce:

  1. Created new vm with Ubuntu installation or cloud init image
  2. Restart host
  3. Start vm and try to connect to the outside

Configuration:

Environment:

Workaround tested:

jmmckenz commented 1 year ago

I ran into this in my lab today.

Observations:

Screenshot from 2023-01-18 16-35-12

IFCFG files for reference:

BONDING_SLAVE_0='enp2s0f1'

BONDING_MODULE_OPTS='miimon=100 mode=active-backup '

DHCLIENT_SET_DEFAULT_ROUTE='no'

   - ifcfg-app-br

STARTMODE='onboot' BOOTPROTO='static' BRIDGE='yes' BRIDGE_STP='off' BRIDGE_FORWARDDELAY='0' BRIDGE_PORTS='app-bo' PRE_UP_SCRIPT="wicked:setup_bridge.sh" POST_UP_SCRIPT="wicked:setup_bridge.sh"



I would expect both the bridge and bond to be recreated when a VM requests them.  The "invalid vlanconfig" error is also concerning.  I cannot discount the possibility it is just something in my setup, since this is my first harvester install, so any insight would be greatly appreciated.
jmmckenz commented 1 year ago

Support Bundle uploaded for reference: supportbundle_b5129a5e-863d-4b04-800c-f92b773c7f29_2023-01-23T22-29-57Z.zip

w13915984028 commented 1 year ago

cc @yaocw2020 please take a look when you are free, thanks.

jmmckenz commented 1 year ago

FYI:

I was able to add semi-persistence of the interfaces by adding the appropriate sections into /oem/99_custom.yaml for the bond, bridge, and interface. I also made adjustments to the wicked setup_bond.sh and setup_bridge.sh to account for the CustomNetwork.

The hosts still have the "invalid vlanconfig" error. Attaching support bundle based on this new config. supportbundle_b5129a5e-863d-4b04-800c-f92b773c7f29_2023-01-24T20-05-05Z.zip

w13915984028 commented 1 year ago

From the first support-bundle file https://github.com/harvester/harvester/issues/3337#issuecomment-1401100012 (logs\harvester-system\harvester-network-controller-zc6v7), there are continuous error messages, which lead to the observed phenomenon.

2023-01-23T19:40:27.784236593Z I0123 19:40:27.784020       1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
2023-01-23T19:40:27.784299170Z I0123 19:40:27.784187       1 controller.go:133] matchedNodes: [mcharvester01 mcharvester02], h.nodeName: mcharvester02
2023-01-23T19:40:27.784449746Z time="2023-01-23T19:40:27Z" level=info msg="Starting k8s.cni.cncf.io/v1, Kind=NetworkAttachmentDefinition controller"
2023-01-23T19:40:27.784807121Z I0123 19:40:27.784679       1 controller.go:275] add nad:homenet with vid:0 to the list
2023-01-23T19:40:27.984449790Z time="2023-01-23T19:40:27Z" level=error msg="error syncing 'appnet': handler harvester-network-vlanconfig-controller: set up VLAN failed, vlanconfig: appnet, node: mcharvester02, error: failed to get local area from nad default/homenet, error: invalid vlanconfig , requeuing"
2023-01-23T19:40:28.083325673Z I0123 19:40:28.083082       1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
...

2023-01-23T22:29:42.832624467Z I0123 22:29:42.832231       1 controller.go:75] vlan config appnet has been changed, spec: {Description: ClusterNetwork:app NodeSelector:map[] Uplink:{NICs:[enp2s0f1] LinkAttrs:0xc002f48f30 BondOptions:0xc002f5c7c8}}
2023-01-23T22:29:42.832707084Z I0123 22:29:42.832329       1 controller.go:133] matchedNodes: [mcharvester01 mcharvester02], h.nodeName: mcharvester02
2023-01-23T22:29:42.832732504Z I0123 22:29:42.832396       1 controller.go:275] add nad:homenet with vid:0 to the list
2023-01-23T22:29:42.846403899Z time="2023-01-23T22:29:42Z" level=error msg="error syncing 'appnet': handler harvester-network-vlanconfig-controller: set up VLAN failed, vlanconfig: appnet, node: mcharvester02, error: failed to get local area from nad default/homenet, error: invalid vlanconfig , requeuing"
w13915984028 commented 1 year ago

kernel log of mcharvester02 has bond on enp2s0f0, but no bond on enp2s0f1

Jan 23 19:39:02 mcharvester02 kernel: bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Jan 23 19:39:02 mcharvester02 kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Jan 23 19:39:02 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): Enslaving as a backup interface with a down link
Jan 23 19:39:02 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered blocking state
Jan 23 19:39:02 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered disabled state
Jan 23 19:39:05 mcharvester02 kernel: igb 0000:02:00.0 enp2s0f0: igb: enp2s0f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): link status definitely up, 1000 Mbps full duplex
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: (slave enp2s0f0): making interface the new active one
Jan 23 19:39:05 mcharvester02 kernel: mgmt-bo: active interface up!
Jan 23 19:39:05 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered blocking state
Jan 23 19:39:05 mcharvester02 kernel: mgmt-br: port 1(mgmt-bo) entered forwarding state
Jan 23 19:39:05 mcharvester02 kernel: NET: Registered protocol family 17
Jan 23 19:39:08 mcharvester02 kernel: Bridge firewalling registered
Jan 23 19:39:33 mcharvester02 kernel: bpfilter: Loaded bpfilter_umh pid 2598

by contrast, mcharvester01 has app-bo log, but only at Jan 18 22:24, the later reboot also has no such log

Jan 18 22:24:01 mcharvester01 kernel: app-bo: (slave enp2s0f1): Enslaving as a backup interface with a down link
Jan 18 22:24:04 mcharvester01 kernel: igb 0000:02:00.1 enp2s0f1: igb: enp2s0f1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Jan 18 22:24:04 mcharvester01 kernel: app-bo: (slave enp2s0f1): link status definitely up, 1000 Mbps full duplex
Jan 18 22:24:04 mcharvester01 kernel: app-bo: (slave enp2s0f1): making interface the new active one
Jan 18 22:24:04 mcharvester01 kernel: app-bo: active interface up!
Jan 18 22:25:10 mcharvester01 kernel: scsi host8: iSCSI Initiator over TCP/IP
w13915984028 commented 1 year ago

log of wicked, it shows the NODE rebooting history and setting of management network

-- Reboot --
Jan 22 00:27:28 mcharvester01 systemd[1]: Starting wicked managed network interfaces...
Jan 22 00:27:34 mcharvester01 wicked[1878]: lo              up
Jan 22 00:27:34 mcharvester01 wicked[1878]: enp2s0f0        enslaved
Jan 22 00:27:34 mcharvester01 wicked[1878]: mgmt-br         up
Jan 22 00:27:34 mcharvester01 wicked[1878]: mgmt-bo         enslaved
Jan 22 00:27:34 mcharvester01 systemd[1]: Finished wicked managed network interfaces.
Jan 22 00:27:57 mcharvester01 systemd[1]: Stopping wicked managed network interfaces...
Jan 22 00:27:58 mcharvester01 wicked[2452]: enp2s0f0        device-ready
Jan 22 00:27:58 mcharvester01 systemd[1]: wicked.service: Succeeded.
Jan 22 00:27:58 mcharvester01 systemd[1]: Stopped wicked managed network interfaces.
-- Reboot --
Jan 23 18:14:25 mcharvester01 systemd[1]: Starting wicked managed network interfaces...
Jan 23 18:14:31 mcharvester01 wicked[1910]: lo              up
Jan 23 18:14:31 mcharvester01 wicked[1910]: enp2s0f0        enslaved
Jan 23 18:14:31 mcharvester01 wicked[1910]: mgmt-br         up
Jan 23 18:14:31 mcharvester01 wicked[1910]: mgmt-bo         enslaved
Jan 23 18:14:31 mcharvester01 systemd[1]: Finished wicked managed network interfaces.
-- Reboot --
Jan 22 00:27:23 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 22 00:27:29 mcharvester02 wicked[1851]: lo              up
Jan 22 00:27:29 mcharvester02 wicked[1851]: enp2s0f0        enslaved
Jan 22 00:27:29 mcharvester02 wicked[1851]: mgmt-br         up
Jan 22 00:27:29 mcharvester02 wicked[1851]: mgmt-bo         enslaved
Jan 22 00:27:29 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
Jan 22 00:28:00 mcharvester02 systemd[1]: Stopping wicked managed network interfaces...
Jan 22 00:28:01 mcharvester02 wicked[2381]: enp2s0f0        device-ready
Jan 22 00:28:01 mcharvester02 systemd[1]: wicked.service: Succeeded.
Jan 22 00:28:01 mcharvester02 systemd[1]: Stopped wicked managed network interfaces.
-- Reboot --
Jan 23 18:14:25 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 23 18:14:30 mcharvester02 wicked[1876]: lo              up
Jan 23 18:14:30 mcharvester02 wicked[1876]: enp2s0f0        enslaved
Jan 23 18:14:30 mcharvester02 wicked[1876]: mgmt-br         up
Jan 23 18:14:30 mcharvester02 wicked[1876]: mgmt-bo         enslaved
Jan 23 18:14:30 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
-- Reboot --
Jan 23 19:39:01 mcharvester02 systemd[1]: Starting wicked managed network interfaces...
Jan 23 19:39:08 mcharvester02 wicked[1849]: lo              up
Jan 23 19:39:08 mcharvester02 wicked[1849]: enp2s0f0        enslaved
Jan 23 19:39:08 mcharvester02 wicked[1849]: mgmt-br         up
Jan 23 19:39:08 mcharvester02 wicked[1849]: mgmt-bo         enslaved
Jan 23 19:39:08 mcharvester02 systemd[1]: Finished wicked managed network interfaces.
harvesterhci-io-github-bot commented 1 year ago

added backport-needed/1.1.2 issue: #3398.

w13915984028 commented 1 year ago

@mpk166 @jmmckenz Thanks for reporting this issue, the root cause is identified and the fix is ongoing.

harvesterhci-io-github-bot commented 1 year ago

Pre Ready-For-Testing Checklist

* [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:

* [ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:

* [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:

* [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue? - The automation skeleton PR is at: - The automation test case PR is at:

* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at:

harvesterhci-io-github-bot commented 1 year ago

Automation e2e test issue: harvester/tests#708

yaocw2020 commented 1 year ago

For testing:

yaocw2020 commented 1 year ago

Workaround: Delete all untagged networks if possible before rebooting nodes. Or use VLAN 1 to replace untagged network.

TachunLin commented 1 year ago

Verified fixed on master-213d4af7-head (02/14) . Close this issue.

Result

After reboot node machine, the untagged Network on VM can remain working as expected. image

Test Information

Verify Steps

  1. Create an untagged network under mgmt cluster network. image

  2. Start a VM with the untagged network. image image image

  3. Reboot Harvester node

  4. Check the VM network can be connected