opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.27k stars 727 forks source link

ix0 no carrier #2591

Closed abplfab closed 5 years ago

abplfab commented 6 years ago

After upgrading to opnsense 18.7 the ix NIC (attached with a DAC to a switch) reports "media: no carrier". Setting the media to fixed 10Gbase-Twinax doesn't help...

fichtner commented 6 years ago

the joys of intel driver updates :(

run this and reboot

# opnsense-update -kr 18.1.11 -n "18.1\/dummy"
fichtner commented 6 years ago

ps booting from kernel.old should also work via boot menu

abplfab commented 6 years ago

Thanks

fichtner commented 6 years ago

if this indeed works we need to take this to FreeBSD soon if 11.2 has the same defect

abplfab commented 6 years ago

Doesn't help :(

fichtner commented 6 years ago

makes no sense at all ?!

what's the output of:

 # uname -a
abplfab commented 6 years ago

FreeBSD asterix2.lan.neratec.com 11.1-RELEASE-p11 FreeBSD 11.1-RELEASE-p11 21b4c8ea1d5(stable/18.7) amd64

abplfab commented 6 years ago

Did the opnsense-update -kr 18.1.11 -n "18.1\/dummy again, now: root@asterix2:~ # uname -a FreeBSD asterix2.lan.neratec.com 11.1-RELEASE-p11 FreeBSD 11.1-RELEASE-p11 116e406d37f(stable/18.1) amd64 but still: ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6> ether 0c:c4:7a:97:ee:ee hwaddr 0c:c4:7a:97:ee:ee inet6 fe80::ec4:7aff:fe97:eeee%ix0 prefixlen 64 scopeid 0x1 inet6 2a02:aa08:e000:902::253 prefixlen 64 inet6 2a02:aa08:e000:902::10 prefixlen 64 vhid 18 inet 192.168.11.253 netmask 0xffffff00 broadcast 192.168.11.255 inet 192.168.11.10 netmask 0xffffff00 broadcast 192.168.11.255 vhid 11 nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> media: Ethernet autoselect status: no carrier carp: INIT vhid 11 advbase 1 advskew 100 carp: INIT vhid 18 advbase 1 advskew 100

fichtner commented 6 years ago

ok, kernels are correctly replaced. I'm unsure how any other change would relate to the reported issue of the driver reporting no carrier anymore.

abplfab commented 6 years ago

Would it make sense to try install new with 18.7 and import the config? This system is the slave of a carp cluster, so currently no hurry...

fichtner commented 6 years ago

going back to 18.1.13 would make more sense than trying again with 18.7 (18.1.6 config import + install and update back to 18.1.13). but the chances for ix0 finding its carrier is not more than 50%. it could be hardware related.

abplfab commented 6 years ago
fichtner commented 6 years ago

Could it be that FreeBSD 11.2 ships with a new binary firmware blob that bricked your NIC? That's the only theory I have besides it's fully the hardware's fault.

abplfab commented 6 years ago

No idea, the server has a second NIC (ix1), will try this one and report...

abplfab commented 6 years ago

So with ix NICs better not update to 18.7.

System Supermicro SYS-5018D-FN8T

Mainboard X10SDV-TP8F

Firmware versions BIOS 1.3, IPMI/BMC: 3.68 Redfish Version : 1.0.1

NIC ix0@pci0:4:0:0: class=0x020000 card=0x15ac15d9 chip=0x15ac8086 rev=0x00 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Connection X552 10 GbE SFP+' class = network subclass = ethernet

CPU Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz (8 cores)

mimugmail commented 6 years ago

@fichtner I have two identical boxed like above configured but not in production, ping me the usual way when you need access.

abplfab commented 6 years ago

FYI: a server reset or power off/on by ipmi/bmc doesn't "fix" the NIC hang. Had to remove power from the box.

Tsuroerusu commented 6 years ago

I am seeing something that could well be related to this issue. After upgrading my company setup (A HA setup of two firewalls) to 18.7, the result was that none of my VLAN interfaces on ix1 activate properly, however the non-VLAN ix0 works fine. On the main page, the VLAN interfaces are marked with red saying "Ethernet autoselect", and under System --> Interfaces --> Overview their status says "no carrier". All of my igb* interfaces come up without any issue. I have tried changing the VLAN configuration a bit hoping that a re-write of the configuration would solve it, but to no avail.

Update: I just tried booting kernel.old, as was suggested earlier in this thread, and just doing that has resolved the problem on both of my nodes! I did not have to do any actual power cycling, just booting into kernel.old did it.

fichtner commented 6 years ago

@Tsuroerusu Can you try the 18.1.11 kernel as well?

# opnsense-update -kr 18.1.11 -n "18.1\/dummy"

(reboot)

@abplfab said you need to remove the power, otherwise the carrier will not come up on a quick reboot.

fichtner commented 6 years ago

Relevant driver update was reverted and will be gone from 18.7.1. It's unclear if the issue exists in FreeBSD 11.2 but we'll find out soon (@mimugmail could you check this with your test system).

mimugmail commented 6 years ago

@fichtner Update is in progress .. need to figure out what exactly happens with and without VLANs.

fichtner commented 6 years ago

@mimugmail thanks a lot!

abplfab commented 6 years ago

Just started a live CD of FreeBSD 11.2 -> ix0 status: no carrier grafik

fichtner commented 6 years ago

@abplfab thanks for confirming. this is really bad :(

mimugmail commented 6 years ago

Wow .. really noone using FreeBSD 11.2 with VLANs in production? Not even a user on pfsense beta, real hard tested, hail to the Netgate-Team, environment???

mimugmail commented 6 years ago

@fichtner I cannot confirm that I'm affected:

18.7 without VLANs:

ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,TXCSUM_IPV6>
        ether ac:1f:6b:65:a5:a5
        hwaddr ac:1f:6b:65:a5:a5
        inet 10.12.12.1 netmask 0xffffff00 broadcast 10.12.12.255
        inet6 fe80::ae1f:6bff:fe65:a5a5%ix1 prefixlen 64 scopeid 0x2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-SR <full-duplex,rxpause,txpause>)
        status: active

18.7 with VLANs:

ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,TXCSUM_IPV6>
        ether ac:1f:6b:65:a5:a5
        hwaddr ac:1f:6b:65:a5:a5
        inet6 fe80::ae1f:6bff:fe65:a5a5%ix1 prefixlen 64 scopeid 0x2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-SR <full-duplex,rxpause,txpause>)
        status: active

ix1_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=400000<TXCSUM_IPV6>
        ether ac:1f:6b:65:a5:a5
        inet6 fe80::ae1f:6bff:fe65:a5a5%ix1_vlan111 prefixlen 64 scopeid 0x11
        inet 10.12.12.1 netmask 0xffffff00 broadcast 10.12.12.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-SR <full-duplex,rxpause,txpause>)
        status: active
        vlan: 111 vlanpcp: 0 parent interface: ix1
        groups: vlan

https://www.thomas-krenn.com/en/products/application/hardware-it-security/opnsense-firewalls/ri1102d.html

ix0@pci0:4:0:0: class=0x020000 card=0x15ac15d9 chip=0x15ac8086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection X552 10 GbE SFP+'
    class      = network
    subclass   = ethernet
ix1@pci0:4:0:1: class=0x020000 card=0x15ac15d9 chip=0x15ac8086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Connection X552 10 GbE SFP+'
    class      = network
    subclass   = ethernet

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> mem 0xfbc00000-0xfbdfffff,0xfbe04000-0xfbe07fff irq 11 at device 0.0 on pci5
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: ac:1f:6b:65:a5:a4
ix0: netmap queues/slots: TX 8/2048, RX 8/2048
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> mem 0xfba00000-0xfbbfffff,0xfbe00000-0xfbe03fff irq 10 at device 0.1 on pci5
ix1: Using MSI-X interrupts with 9 vectors
ix1: Ethernet address: ac:1f:6b:65:a5:a5
ix1: netmap queues/slots: TX 8/2048, RX 8/2048
ix1: link state changed to UP
Tsuroerusu commented 6 years ago

@fichtner Unfortunately, I only have ix in my production firewalls, so I cannot experiment much at all with my OPNsense setup. However, I do have an identical ix-based card in a server that is not yet fully in production, and I can try running FreeBSD 11.2 live on that when I got back home in a few hours.

abplfab commented 6 years ago

I can use my slave carp firewall if any testing is needed. @mimugmail looks like you use SFP(+) modules, here I use DAC connected to the switch directly. Tried to disable TSO / LRO -> no change with 18.7 Tried to disable VLAN hardware filtering -> no change with 18.7 And yes, i use VLAN on ix0

mimugmail commented 6 years ago

I replaced with Twinax ... some result:

ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,TXCSUM_IPV6>
        ether ac:1f:6b:65:a5:a5
        hwaddr ac:1f:6b:65:a5:a5
        inet6 fe80::ae1f:6bff:fe65:a5a5%ix1 prefixlen 64 scopeid 0x2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
        status: active

ix1_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=400000<TXCSUM_IPV6>
        ether ac:1f:6b:65:a5:a5
        inet6 fe80::ae1f:6bff:fe65:a5a5%ix1_vlan111 prefixlen 64 scopeid 0x11
        inet 10.12.12.1 netmask 0xffffff00 broadcast 10.12.12.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
        status: active
        vlan: 111 vlanpcp: 0 parent interface: ix1
        groups: vlan
ruffy91 commented 6 years ago

This: https://sourceforge.net/p/e1000/mailman/message/35263903/ points to a HW/FW problem (similar problem with linux kernel 4.7/4.8). @abplfab Have you tried to loopback the twinax cable to the NIC to make sure it is not a compability problem with the switch?

abplfab commented 6 years ago

@ruffy91 yep. Looks exactly like that. Loop ix0 - ix1 I get link. Switch is a Netgear XS716T :(

ruffy91 commented 6 years ago

There is a FW for the switch which describes this problem: https://kb.netgear.com/000038683/XS708T-XS716T-Firmware-Version-6-6-1-7

abplfab commented 6 years ago

Yep, but I'm running firmware 6.6.3.3

ruffy91 commented 6 years ago

There are also problems (maybe unrelated but shows that the NIC FW plays a role with compatibility) with other Mainboards from Supermicro: https://tinkertry.com/how-to-work-around-intermittent-intel-x557-network-outages-on-12-core-xeon-d and https://forums.freebsd.org/threads/driver-for-intel-pci-e-10-gigabit-nic-specifically-x552-x557-at.57536/ I would suggest opening a case with Supermicro to aks if there are known problem or a newer Firmware for the Intel NIC.

We have Huawei R1288H v5 which get regular Intel NIC FW updates (Intel X722), unfortunately SuperMicro does not seem to release the NIC FW to customers as you can see in the linked blogpost.

Maybe you can find and tell the firmware-version in kernel/driver logs as further data points for comparison with other platforms which do not have the problem.

It seems to me that 10GbE is still in it's infancy, despite 40/100GbE now becoming mainstream for the hyperscalers

abplfab commented 6 years ago

Opened a case with supermicro. Lets see.

abplfab commented 6 years ago

Got new firmware for the NICs, but doesn't help :(. Firmware attached. "To flash, please boot from a DOS bootable USB stick and run the "7TP8F.BAT" batch file in the attached package. The rest is automatic." sdvtp2c.zip

mimugmail commented 6 years ago

@abplfab Just to summarize:

OPNsense 18.1.13 works with SFP+ to Switch OPNsense 18.7 doesnt work with SFP+ to Switch Downgrade to 18.1.13, power off and it works again to Switch OPNsense 18.7 from Port1 to Port2 on Dual NIC works FreeBSD 11.2 live CD doesn't work

What about FreeBSD 12 and 11.1 (to really preclude it's live CD itself)?

@Tsuroerusu Has identical problem, gets fixed with booting old kernel. Have you already tried the loop? What are your hardware specs? What about live CD?

I'm waiting for my lab to come back and test with ixl instead of ix, but for me everything works. I always test with a second machine (no switch), both Direct Attach and GBic and LWL cable.

ruffy91 commented 6 years ago

That was super fast. Unfortunately I am out of ideas. @fichtner I can only say that we use Intel X722 and 18.7 is the first usable release with the new drivers. We now have working CARP and everything is smooth. Is this revert only for the ix/ixgbe driver or also for ixl?

fichtner commented 6 years ago

@ruffy91 ixl is a separate driver backport we are keeping if you say it works fine now :)

abplfab commented 6 years ago

@mimugmail correct. FreeBSD 11.1 live CD: works FreeBSD 12.0-CURRENT live CD: doesn't work

mimugmail commented 6 years ago

So, with a X710 and ixl it works too:

18.7 no VLANs:
ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6402b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 3c:fd:fe:9e:e7:48
        hwaddr 3c:fd:fe:9e:e7:48
        inet6 fe80::3efd:feff:fe9e:e748%ixl0 prefixlen 64 scopeid 0x1
        inet 10.55.1.1 netmask 0xffffff00 broadcast 10.55.1.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
        status: active

18.7. with VLANs:
ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=6002a8<VLAN_MTU,JUMBO_MTU,VLAN_HWCSUM,TSO6,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 3c:fd:fe:9e:e7:48
        hwaddr 3c:fd:fe:9e:e7:48
        inet6 fe80::3efd:feff:fe9e:e748%ixl0 prefixlen 64 scopeid 0x1
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
        status: active

ixl0_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 3c:fd:fe:9e:e7:48
        inet6 fe80::3efd:feff:fe9e:e748%ixl0_vlan111 prefixlen 64 scopeid 0xa
        inet 10.55.2.1 netmask 0xffffff00 broadcast 10.55.2.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
        status: active
        vlan: 111 vlanpcp: 0 parent interface: ixl0
        groups: vlan

ixl0@pci0:1:0:0:        class=0x020000 card=0x00088086 chip=0x15728086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = 'Ethernet Controller X710 for 10GbE SFP+'
    class      = network
    subclass   = ethernet

ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.9.9-k> mem 0xdd000000-0xdd7fffff,0xdd808000-0xdd80ffff irq 16 at device 0.0 on pci1
ixl0: using 1024 tx descriptors and 1024 rx descriptors
ixl0: fw 5.0.40043 api 1.5 nvm 5.05 etid 80002892 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Using MSIX interrupts with 9 vectors
ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl0: Ethernet address: 3c:fd:fe:9e:e7:48
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 8/1024, RX 8/1024
Tsuroerusu commented 6 years ago

@mimugmail My problem is actually a little different, it is a half-way house of what the rest of you seem to be dealing with.

For me the summary would be:

  1. OPNsense 18.1.13 works across the board, no problems.

  2. OPNsense 18.7 seems to work when the ix NICs are utilized without VLANs. To elaborate a bit: I use a Supermicro AOC-STGN-i2S card (Intel 82599), which provides two ix NICs via SFP+ ports for each of my two firewall nodes. ix0 on each node is the WAN port (and as such has no VLANs or other special stuff), and is connected to a Juniper EX3300 switch via DAC cables. ix1 on each node is connected, also via DAC cables, to a D-Link DGS-3420-52T switch. ix1 is configured only with VLANs, 5 in total, and has no non-VLAN configuration. When I boot into the 18.7 kernel, ix0 works fine and I can access the Internet without any problems, but ix1 are completely dead saying "no carrier". Before upgrading, this setup had been working swimmingly, no incompatibilities at all to speak of for over a year. So my problem seems to be exclusively with VLANs in regard to 18.7 and ix NICs.

  3. I have not tried downgrading, because it only took a boot into kernel.old to make things work again, and it was not necessary to power cycle, just a simple reboot was enough.

  4. I have another AOC-STGN-i2S card in one of my AMD-based servers (Which admittedly is quite a different beast than my firewalls, which are Atom C2000-based) , and when booting the FreeBSD 11.2 install media in live mode, this is what happens: a) The first thing I notice is that, even when unconfigured, the switch shows green lights on those ports. b) I can successfully configure both ix0 and ix1 as non-VLAN interfaces. c) I can successfully configure both ix0 and ix1 as VLAN interfaces.

  5. I have not tried loop-backing the NICs, because ix0 to the Juniper switch works, and in my server both NICs to the D-Link switch works as well. As far as I can tell, I am not suffering from hardware incompatibilities.

From my perspective, vanilla FreeBSD 11.2 seems to work fine for my AOC-STGN-i2S card. So my first thought is that the backport of the driver has a problem for some reason, but I am no kernel hacker at all, so I cannot really tell.

mimugmail commented 6 years ago

Next 18.7 system with ix0, X520 NIC, also working fine. I'm trying to search for a 10G switch to reproduce, but atm it seems to be a very specific problem :(

root@OPNsense:~ # ifconfig ix0
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=c000a8<VLAN_MTU,JUMBO_MTU,VLAN_HWCSUM,TXCSUM_IPV6>
        ether 90:e2:ba:39:1f:10
        hwaddr 90:e2:ba:39:1f:10
        inet6 fe80::92e2:baff:fe39:1f10%ix0 prefixlen 64 scopeid 0x1
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
        status: active
root@OPNsense:~ # ifconfig ix0_vlan111
ix0_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 90:e2:ba:39:1f:10
        inet6 fe80::92e2:baff:fe39:1f10%ix0_vlan111 prefixlen 64 scopeid 0x9
        inet 10.55.2.2 netmask 0xffffff00 broadcast 10.55.2.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
        status: active
        vlan: 111 vlanpcp: 0 parent interface: ix0
        groups: vlan
root@OPNsense:~ # dmesg | grep ix0
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xe020-0xe03f mem 0xdde80000-0xddefffff,0xddf04000-0xddf07fff irq 16 at device 0.0 on pci1
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: 90:e2:ba:39:1f:10
ix0: PCI Express Bus: Speed 5.0GT/s Width x4
ix0: netmap queues/slots: TX 8/2048, RX 8/2048
vlan0: changing name to 'ix0_vlan111'
ix0: link state changed to UP
ix0_vlan111: link state changed to UP

ix0@pci0:1:0:0: class=0x020000 card=0x00038086 chip=0x10fb8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet

EDIT: This is a Fujitsu RX1300 system ...

abplfab commented 6 years ago

Contacted Netgear, they check with their team...

fichtner commented 6 years ago

This may be related: https://github.com/opnsense/core/commit/4ba0fa679

Try to use Interfaces: Settings: VLAN Hardware Filtering: "Leave default".

abplfab commented 6 years ago

Doesn't help :(

fichtner commented 6 years ago

Ok, bummer. It was a long shot :(

abplfab commented 6 years ago

Tried today with an M5300-52G Netgear Switch (11.0.0.31, B1.0.0.5). Same result: no link.

Tsuroerusu commented 6 years ago

Was this issue resolve or somewhat alleviated in 18.7.1? I ask because I could not find anything about it in the release notes.

fichtner commented 6 years ago

No change in 18.7.1. kernel.old is phased out (now is the 18.7 kernel because there is a new 18.7.1 kernel), but the manual revert to 18.1.11 should still work minus the set verification (i):

# opnsense-update -ikr 18.1.11 -n "18.1\/dummy"