harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.89k stars 326 forks source link

[BUG] Network Devices Renamed on Reboot and Link Failures with Intel E810-XXVDA2 Network Card on Supermicro SYS-740GP-TNRT #7021

Open Maxine-N opened 4 days ago

Maxine-N commented 4 days ago

Describe the bug
After installing Harvester, the network devices in our system change their names after every reboot, which disrupts network bonding configurations. Additionally, some network devices remain down despite being physically connected. We believe the issue may be related to the Intel E810-XXVDA2 network card installed in our system.

This problem is preventing the node from joining the Harvester cluster. The rancher-system-agent logs show repeated warnings like the following:

Nov 18 16:38:46 harvester02-tamedai01 systemd[1]: Started Rancher System Agent.
Nov 18 16:38:46 harvester02-tamedai01 rancher-system-agent[7793]: time="2024-11-18T16:38:46Z" level=info msg="Rancher System Agent version v0.3.6 (41c07d0) is starting"
Nov 18 16:38:46 harvester02-tamedai01 rancher-system-agent[7793]: time="2024-11-18T16:38:46Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Nov 18 16:38:46 harvester02-tamedai01 rancher-system-agent[7793]: time="2024-11-18T16:38:46Z" level=info msg="Starting remote watch of plans"
Nov 18 16:38:46 harvester02-tamedai01 rancher-system-agent[7793]: time="2024-11-18T16:38:46Z" level=info msg="Starting /v1, Kind=Secret controller"
Nov 18 16:39:46 harvester02-tamedai01 rancher-system-agent[7793]: W1118 16:39:46.573537  7793 reflector.go:456] ... watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 17; INTERNAL_ERROR; received from peer") has prevented the request from succeeding

We have already attempted to resolve the issue by updating the network devices to the latest firmware, but this has not resolved the problem.

To Reproduce
Steps to reproduce the behavior:

  1. Install Harvester on Supermicro Superserver SYS-740GP-TNRT hardware with Intel E810-XXVDA2 network card.
  2. Reboot the server.
  3. Observe changes in network device names and connectivity status.

Expected behavior
Network devices should retain consistent naming across reboots, all physically connected network devices should be active, and the node should successfully join the cluster.

Support bundle
Support bundle could not be generated, as the node is unable to join the cluster.

Environment

Additional context
The server has the following key hardware specifications:

samueldewever commented 4 days ago

@Maxine-N

We had the same issue with same NIC, however not at every reboot TIcket ref: https://github.com/harvester/harvester/issues/6808

We have yesterday done lab testings with RC5 and problem seems resolved for us with RC5 (or at least not longer reproducable). So should be fine with the official release of 1.4.0.

If you want, you can test RC5 to confirm as you seems to have the issue at every reboot.

staedter commented 4 days ago

Any idea when the 1.4.0 will be released?

samueldewever commented 4 days ago

Any idea when the 1.4.0 will be released?

Release: 11/27 So in a bit more as a week ;-)