opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.34k stars 748 forks source link

LACP Port Errors #3904

Closed tomatotoast closed 4 years ago

tomatotoast commented 4 years ago

Important notices Before you add a new report, we ask you kindly to acknowledge the following:

[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[X] I have searched the existing issues and I'm convinced that mine is new.

Describe the bug Creating and using a LACP LAGG interface causes errors on the LAG. After rebooting the machine, the LAG is not usable. the physical connections need to be replugged.

To Reproduce Create The LAGG and check the error counter (no physical link attached)

root@OPNsense:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#11> 00:1b:21:a7:5b:f2 0 0 0 0 5 0 lagg0 - fe80::%lagg0/ fe80::21b:21ff:fe 0 - - 2 - -

Then in moved a vlan interface to the lagg, send some traffic over it and plugged and unplugged the physical links one after another.

root@OPNsense:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#11> 00:1b:21:a7:5b:f2 9843 0 0 4232 34 0 lagg0 - fe80::%lagg0/ fe80::21b:21ff:fe 0 - - 2 - -

Then i rebooted the switch and the opnsense and testet again. After reboot I noticed that the link did not work. I had to unplug both physical cables and replugg them.

This is what the error counters looked like after sending some traffic through again:

root@OPNsense:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#11> 00:1b:21:a7:5b:f2 12385 0 0 7016 82 0 lagg0 - fe80::%lagg0/ fe80::21b:21ff:fe 0 - - 2 - -

root@OPNsense:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#11> 00:1b:21:a7:5b:f2 13326 0 0 8134 135 0 lagg0 - fe80::%lagg0/ fe80::21b:21ff:fe 0 - - 2 - -

With or without VLAN hardware filtering the same thing happens.

Expected behavior A LACP link should be usable without any errors

Screenshots

Creating LAGG: https://ibb.co/HqGd9s9

Configure vlan to lagg parent if: https://ibb.co/1JCj1jR

Interface settings: https://ibb.co/VWqg2hW

lacp switch configuration 1/2: https://ibb.co/Ptd7fJQ

lacp switch configuration 2/2: https://ibb.co/G0WwL6L

switch interface stats before connecting physical lacp links to opnsense: https://ibb.co/2hj65vR

switch interface stats after connecting physical lacp links to opnsense: https://ibb.co/dKZcyRN

Relevant log files If applicable, information from log files supporting your claim.

Additional context No errors appear on the Switch. This error does not appear using LACP on Ubuntu Server 19.10 Kernel 5.3 so i guess it is no Hardware related issue. This occurs even with completely different hardware (Sophos XG 105 rev2).

Environment Versions OPNsense 20.1-amd64 FreeBSD 11.2-RELEASE-p16-HBSD OpenSSL 1.1.1d 10 Sep 2019

Intel Xeon E3-1220v6 Intel i340-T4 Gigabit Nic

Switch:

Device Information Device Type DGS-1210-26 Gigabit Ethernet Switch Boot Version 1.00.010 Firmware Version 6.12.B006 Hardware Version F1

NIC:

root@OPNsense:~ # pciconf -l -BbceVv igb2@pci0:1:0:2: class=0x020000 card=0x12a18086 chip=0x150e8086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82580 Gigabit Network Connection' class = network subclass = ethernet bar [10] = type Memory, range 32, base 0xde180000, size 524288, enabled bar [1c] = type Memory, range 32, base 0xde304000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks cap 11[70] = MSI-X supports 10 messages, enabled Table in map 0x1c[0x0], PBA in map 0x1c[0x2000] cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1) ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0003[140] = Serial 1 001b21ffffa75bf0 ecap 0017[1a0] = TPH Requester 1 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error igb3@pci0:1:0:3: class=0x020000 card=0x12a18086 chip=0x150e8086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82580 Gigabit Network Connection' class = network subclass = ethernet bar [10] = type Memory, range 32, base 0xde100000, size 524288, enabled bar [1c] = type Memory, range 32, base 0xde300000, size 16384, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks cap 11[70] = MSI-X supports 10 messages, enabled Table in map 0x1c[0x0], PBA in map 0x1c[0x2000] cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR NS link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1) ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected ecap 0003[140] = Serial 1 001b21ffffa75bf0 ecap 0017[1a0] = TPH Requester 1 PCI-e errors = Correctable Error Detected Unsupported Request Detected Corrected = Advisory Non-Fatal Error

mimugmail commented 4 years ago

Screenshot 1, both NICs have the same Mac address, looks weird.

mimugmail commented 4 years ago

Does this also occur with lacp and no vlans? Maybe a native vlan mismatch in the switch?

tomatotoast commented 4 years ago

IT appears even if no switch is attached. So no vlans.

(no physical links) still errors root@OPNsense:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#11> 00:1b:21:a7:5b:f2 0 0 0 0 5 0 lagg0 - fe80::%lagg0/ fe80::21b:21ff:fe 0 - - 2 - -

both nics have different mac adresses per default: igb2: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:1b:21:a7:5b:f2 hwaddr 00:1b:21:a7:5b:f2 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier igb3: flags=8c02<BROADCAST,OACTIVE,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:1b:21:a7:5b:f3 hwaddr 00:1b:21:a7:5b:f3 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier

tomatotoast commented 4 years ago

While creating the lagg both interfaces have a different mac. once i click edit after lagg creation both interfaces share the same mac.

mimugmail commented 4 years ago

I don't think this (Oerrs) has nothing to do with your problem, can you please check lacp to switch without VLANs and look if you have a link after reboot?

tomatotoast commented 4 years ago

I have created a lagg interface and added an ip to it. then i setup a dhcp server and range for this lag link. on the switch i createt a lacp group with a untagged vlan on it. so to speak access port. i added a vm in esxi to this vlan.

the vm did not receive an ip adress. rebooting the opnsense machine did not do anything.

testlacp interface (opt6, lagg0) Status up MAC address 00:1b:21:a7:5b:f2 - Intel Corporate MTU 1500 IPv4 address 192.168.160.1 / 24 IPv6 Link Local fe80::21b:21ff:fea7:5bf2 / 64 Media Ethernet autoselect LAGG Protocol lacp lagghash l2,l3,l4 LAGG Ports igb2 igb3 In/out packets 0 / 1 (0 bytes / 116 bytes ) In/out packets (pass) 0 / 1 (0 bytes / 116 bytes ) In/out packets (block) 0 / 0 (0 bytes / 0 bytes ) In/out errors 0/4 Collisions 0

mimugmail commented 4 years ago

So, you dont have carrier problems when rebooting without vlans, correct?

tomatotoast commented 4 years ago

The Link is up, but it seems that no data can be transmitted (with or without vlans). The client (vm) does not receive a ip from the dhcp. The packet counter on the interface is 1 out.

German: So wie es aussieht, besteht das Problem unabhängig von Vlans auf dem LAGG Interface. Auch ein Reboot bringt keine Abhilfe. Ein mit dem Netz verbundener Client erhält keine IP von dem auf dem LACP Interface konfiguriertem DHCP-Server.

mimugmail commented 4 years ago

Can you check with ping and the arp cache on client and server if mac address is learned? Also on the switch.

Which hypervisor is running the VM? Usually VMware uses own HA and not LACP.

tomatotoast commented 4 years ago

I have no idea what went wrong yesterday. My ESXi does not use a multilink configuration. Before testing i rebooted everything and pinging in both directions now works. But the issue with the rising error counter sill persists. On the switch ports there are no errors.

The mac adresstable from the switch: https://ibb.co/bB9zPdz

ifconfig OPNsense:

igb2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO> ether 00:1b:21:a7:5b:f2 hwaddr 00:1b:21:a7:5b:f2 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier igb3: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO> ether 00:1b:21:a7:5b:f2 hwaddr 00:1b:21:a7:5b:f3 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier

arptable opnsense: root@OPNsense:~ # arp -a | grep 160 ? (192.168.160.1) at 00:1b:21:a7:5b:f2 on lagg0 permanent [ethernet] ? (192.168.160.100) at 00:0c:29:07:58:14 on lagg0 expires in 1177 seconds [ethernet]

arptable client: user@test:~$ arp -a ? (192.168.160.1) at 00:1b:21:a7:5b:f2 [ether] on ens160

port errors: root@OPNsense:~ # netstat -i Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#16> 00:1b:21:a7:5b:f2 407 0 0 1222 12 0 lagg0 - fe80::%lagg0/ fe80::21b:21ff:fe 0 - - 2 - - lagg0 - 192.168.160.0 192.168.160.1 182 - - 51 - -

mimugmail commented 4 years ago

You have a learned mac address on lagg0 but both interfaces don't have a link?

tomatotoast commented 4 years ago

Read again: I have no idea what went wrong yesterday. My ESXi does not use a multilink configuration. Before testing i rebooted everything and pinging in both directions now works. But the issue with the rising error counter sill persists. On the switch ports there are no errors.

tomatotoast commented 4 years ago

This error happens even with a complete new installed system. Errors appear even without any cables attached.

mimugmail commented 4 years ago

Look, when you have a Switch tagging packets on port 1 with vlan 5 and on the other side it expects vlan 4, you'll see interface errors. When you see the errors even without cables it's just an error because it can't send a specific kind of traffic, like not forming LACP neighborship

tomatotoast commented 4 years ago

Please answer these four questions:

1.: Why do both physical interfaces share the same MAC adress once the lagg is formed? I have not seen this behavior running Ubuntu Server 19.10 for example.

2.: Why does the error counter go up. Even when the same VLANs are configured on both sides and both interfaces are connected to both lacp port group interfaces on the switch? The devices should easily exchange LACPDUs.

3.: Why does the interface error counter stay at a solid zero running a LAG on Ubuntu Server 19.10? (lag setup with netplan) With or without physical links up.

4.: If it is expected behavior that the counter goes up due to unanswered LACPDUs, where is it documented?

tomatotoast commented 4 years ago

When the nics are disconected i get this output:

Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll lagg0 1500 <Link#19> a8:5e:45:3d:ed:34 0 0 0 0 5 0 lagg0 - fe80::%lagg0/ fe80::aa5e:45ff:f 0 - - 2 - -

2 Out Packages and 5 Out Errors. This is so strange.

Another Freebsd User also experienced this issue: https://forums.freebsd.org/threads/lagg-4-interface-output-errors.46022/

@mimugmail Since you are running an LACP Setup: what does the Error Counter from your lagg Interface look like?

Does it look the same? If so It might be an Freebsd issue.

mimugmail commented 4 years ago

My counters look similar. But they are in production, no way to unplug a cable or move vlans around.

lagg0  1500 <Link#13>     ac:1f:6b:6c:95:2a 17981726252     0     0 7394162286    17     0
lagg0     - fe80::%lagg0/ fe80::ae1f:6bff:f        0     -     -        1     -     -
lagg1  1500 <Link#14>     ac:1f:6b:6c:9b:34 7277851588     0     0 17991805018   108     0
lagg1     - fe80::%lagg1/ fe80::ae1f:6bff:f        0     -     -        2     -     -
lagg1  1500 <Link#15>     ac:1f:6b:6c:9b:34        5     0     0  1006453     7     0
tomatotoast commented 4 years ago

This issue is not caused by OPNsense. It can be closed.

tomatotoast commented 4 years ago

The issue is still persistent after a fresh 20.7 installation. Maybe related to: #4235 https://github.com/opnsense/core/issues/4235

mimugmail commented 4 years ago

Didnt you state it's not related to OPNsense?

tomatotoast commented 4 years ago

Another user also found this problem. Back in 2019: https://forum.opnsense.org/index.php?topic=15005.0

mimugmail commented 4 years ago

In the forum Post is stated that he can reach full throughput. Error can come from everything, unknown packets, wrong checksum when running IPS and offloading and so on. If you dont encounter performance drops just ignore them or report to FreeBSD directly

AdSchellevis commented 4 years ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.