Closed abplfab closed 5 years ago
the joys of intel driver updates :(
run this and reboot
# opnsense-update -kr 18.1.11 -n "18.1\/dummy"
ps booting from kernel.old should also work via boot menu
Thanks
if this indeed works we need to take this to FreeBSD soon if 11.2 has the same defect
Doesn't help :(
makes no sense at all ?!
what's the output of:
# uname -a
FreeBSD asterix2.lan.neratec.com 11.1-RELEASE-p11 FreeBSD 11.1-RELEASE-p11 21b4c8ea1d5(stable/18.7) amd64
Did the opnsense-update -kr 18.1.11 -n "18.1\/dummy
again, now:
root@asterix2:~ # uname -a
FreeBSD asterix2.lan.neratec.com 11.1-RELEASE-p11 FreeBSD 11.1-RELEASE-p11 116e406d37f(stable/18.1) amd64
but still:
ix0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=e507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
ether 0c:c4:7a:97:ee:ee
hwaddr 0c:c4:7a:97:ee:ee
inet6 fe80::ec4:7aff:fe97:eeee%ix0 prefixlen 64 scopeid 0x1
inet6 2a02:aa08:e000:902::253 prefixlen 64
inet6 2a02:aa08:e000:902::10 prefixlen 64 vhid 18
inet 192.168.11.253 netmask 0xffffff00 broadcast 192.168.11.255
inet 192.168.11.10 netmask 0xffffff00 broadcast 192.168.11.255 vhid 11
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: no carrier
carp: INIT vhid 11 advbase 1 advskew 100
carp: INIT vhid 18 advbase 1 advskew 100
ok, kernels are correctly replaced. I'm unsure how any other change would relate to the reported issue of the driver reporting no carrier anymore.
Would it make sense to try install new with 18.7 and import the config? This system is the slave of a carp cluster, so currently no hurry...
going back to 18.1.13 would make more sense than trying again with 18.7 (18.1.6 config import + install and update back to 18.1.13). but the chances for ix0 finding its carrier is not more than 50%. it could be hardware related.
Could it be that FreeBSD 11.2 ships with a new binary firmware blob that bricked your NIC? That's the only theory I have besides it's fully the hardware's fault.
No idea, the server has a second NIC (ix1), will try this one and report...
So with ix NICs better not update to 18.7.
System Supermicro SYS-5018D-FN8T
Mainboard X10SDV-TP8F
Firmware versions BIOS 1.3, IPMI/BMC: 3.68 Redfish Version : 1.0.1
NIC ix0@pci0:4:0:0: class=0x020000 card=0x15ac15d9 chip=0x15ac8086 rev=0x00 hdr=0x00 vendor = 'Intel Corporation' device = 'Ethernet Connection X552 10 GbE SFP+' class = network subclass = ethernet
CPU Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz (8 cores)
@fichtner I have two identical boxed like above configured but not in production, ping me the usual way when you need access.
FYI: a server reset or power off/on by ipmi/bmc doesn't "fix" the NIC hang. Had to remove power from the box.
I am seeing something that could well be related to this issue. After upgrading my company setup (A HA setup of two firewalls) to 18.7, the result was that none of my VLAN interfaces on ix1 activate properly, however the non-VLAN ix0 works fine. On the main page, the VLAN interfaces are marked with red saying "Ethernet autoselect", and under System --> Interfaces --> Overview their status says "no carrier". All of my igb* interfaces come up without any issue. I have tried changing the VLAN configuration a bit hoping that a re-write of the configuration would solve it, but to no avail.
Update: I just tried booting kernel.old, as was suggested earlier in this thread, and just doing that has resolved the problem on both of my nodes! I did not have to do any actual power cycling, just booting into kernel.old did it.
@Tsuroerusu Can you try the 18.1.11 kernel as well?
# opnsense-update -kr 18.1.11 -n "18.1\/dummy"
(reboot)
@abplfab said you need to remove the power, otherwise the carrier will not come up on a quick reboot.
Relevant driver update was reverted and will be gone from 18.7.1. It's unclear if the issue exists in FreeBSD 11.2 but we'll find out soon (@mimugmail could you check this with your test system).
@fichtner Update is in progress .. need to figure out what exactly happens with and without VLANs.
@mimugmail thanks a lot!
Just started a live CD of FreeBSD 11.2 -> ix0 status: no carrier
@abplfab thanks for confirming. this is really bad :(
Wow .. really noone using FreeBSD 11.2 with VLANs in production? Not even a user on pfsense beta, real hard tested, hail to the Netgate-Team, environment???
@fichtner I cannot confirm that I'm affected:
18.7 without VLANs:
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=c400b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWTSO,TXCSUM_IPV6>
ether ac:1f:6b:65:a5:a5
hwaddr ac:1f:6b:65:a5:a5
inet 10.12.12.1 netmask 0xffffff00 broadcast 10.12.12.255
inet6 fe80::ae1f:6bff:fe65:a5a5%ix1 prefixlen 64 scopeid 0x2
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-SR <full-duplex,rxpause,txpause>)
status: active
18.7 with VLANs:
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=c500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,TXCSUM_IPV6>
ether ac:1f:6b:65:a5:a5
hwaddr ac:1f:6b:65:a5:a5
inet6 fe80::ae1f:6bff:fe65:a5a5%ix1 prefixlen 64 scopeid 0x2
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-SR <full-duplex,rxpause,txpause>)
status: active
ix1_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=400000<TXCSUM_IPV6>
ether ac:1f:6b:65:a5:a5
inet6 fe80::ae1f:6bff:fe65:a5a5%ix1_vlan111 prefixlen 64 scopeid 0x11
inet 10.12.12.1 netmask 0xffffff00 broadcast 10.12.12.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-SR <full-duplex,rxpause,txpause>)
status: active
vlan: 111 vlanpcp: 0 parent interface: ix1
groups: vlan
https://www.thomas-krenn.com/en/products/application/hardware-it-security/opnsense-firewalls/ri1102d.html
ix0@pci0:4:0:0: class=0x020000 card=0x15ac15d9 chip=0x15ac8086 rev=0x00 hdr=0x00
vendor = 'Intel Corporation'
device = 'Ethernet Connection X552 10 GbE SFP+'
class = network
subclass = ethernet
ix1@pci0:4:0:1: class=0x020000 card=0x15ac15d9 chip=0x15ac8086 rev=0x00 hdr=0x00
vendor = 'Intel Corporation'
device = 'Ethernet Connection X552 10 GbE SFP+'
class = network
subclass = ethernet
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> mem 0xfbc00000-0xfbdfffff,0xfbe04000-0xfbe07fff irq 11 at device 0.0 on pci5
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: ac:1f:6b:65:a5:a4
ix0: netmap queues/slots: TX 8/2048, RX 8/2048
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> mem 0xfba00000-0xfbbfffff,0xfbe00000-0xfbe03fff irq 10 at device 0.1 on pci5
ix1: Using MSI-X interrupts with 9 vectors
ix1: Ethernet address: ac:1f:6b:65:a5:a5
ix1: netmap queues/slots: TX 8/2048, RX 8/2048
ix1: link state changed to UP
@fichtner Unfortunately, I only have ix in my production firewalls, so I cannot experiment much at all with my OPNsense setup. However, I do have an identical ix-based card in a server that is not yet fully in production, and I can try running FreeBSD 11.2 live on that when I got back home in a few hours.
I can use my slave carp firewall if any testing is needed. @mimugmail looks like you use SFP(+) modules, here I use DAC connected to the switch directly. Tried to disable TSO / LRO -> no change with 18.7 Tried to disable VLAN hardware filtering -> no change with 18.7 And yes, i use VLAN on ix0
I replaced with Twinax ... some result:
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=c500b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,VLAN_HWTSO,TXCSUM_IPV6>
ether ac:1f:6b:65:a5:a5
hwaddr ac:1f:6b:65:a5:a5
inet6 fe80::ae1f:6bff:fe65:a5a5%ix1 prefixlen 64 scopeid 0x2
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
status: active
ix1_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=400000<TXCSUM_IPV6>
ether ac:1f:6b:65:a5:a5
inet6 fe80::ae1f:6bff:fe65:a5a5%ix1_vlan111 prefixlen 64 scopeid 0x11
inet 10.12.12.1 netmask 0xffffff00 broadcast 10.12.12.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
status: active
vlan: 111 vlanpcp: 0 parent interface: ix1
groups: vlan
This: https://sourceforge.net/p/e1000/mailman/message/35263903/ points to a HW/FW problem (similar problem with linux kernel 4.7/4.8). @abplfab Have you tried to loopback the twinax cable to the NIC to make sure it is not a compability problem with the switch?
@ruffy91 yep. Looks exactly like that. Loop ix0 - ix1 I get link. Switch is a Netgear XS716T :(
There is a FW for the switch which describes this problem: https://kb.netgear.com/000038683/XS708T-XS716T-Firmware-Version-6-6-1-7
Yep, but I'm running firmware 6.6.3.3
There are also problems (maybe unrelated but shows that the NIC FW plays a role with compatibility) with other Mainboards from Supermicro: https://tinkertry.com/how-to-work-around-intermittent-intel-x557-network-outages-on-12-core-xeon-d and https://forums.freebsd.org/threads/driver-for-intel-pci-e-10-gigabit-nic-specifically-x552-x557-at.57536/ I would suggest opening a case with Supermicro to aks if there are known problem or a newer Firmware for the Intel NIC.
We have Huawei R1288H v5 which get regular Intel NIC FW updates (Intel X722), unfortunately SuperMicro does not seem to release the NIC FW to customers as you can see in the linked blogpost.
Maybe you can find and tell the firmware-version in kernel/driver logs as further data points for comparison with other platforms which do not have the problem.
It seems to me that 10GbE is still in it's infancy, despite 40/100GbE now becoming mainstream for the hyperscalers
Opened a case with supermicro. Lets see.
Got new firmware for the NICs, but doesn't help :(. Firmware attached. "To flash, please boot from a DOS bootable USB stick and run the "7TP8F.BAT" batch file in the attached package. The rest is automatic." sdvtp2c.zip
@abplfab Just to summarize:
OPNsense 18.1.13 works with SFP+ to Switch OPNsense 18.7 doesnt work with SFP+ to Switch Downgrade to 18.1.13, power off and it works again to Switch OPNsense 18.7 from Port1 to Port2 on Dual NIC works FreeBSD 11.2 live CD doesn't work
What about FreeBSD 12 and 11.1 (to really preclude it's live CD itself)?
@Tsuroerusu Has identical problem, gets fixed with booting old kernel. Have you already tried the loop? What are your hardware specs? What about live CD?
I'm waiting for my lab to come back and test with ixl instead of ix, but for me everything works. I always test with a second machine (no switch), both Direct Attach and GBic and LWL cable.
That was super fast. Unfortunately I am out of ideas. @fichtner I can only say that we use Intel X722 and 18.7 is the first usable release with the new drivers. We now have working CARP and everything is smooth. Is this revert only for the ix/ixgbe driver or also for ixl?
@ruffy91 ixl is a separate driver backport we are keeping if you say it works fine now :)
@mimugmail correct. FreeBSD 11.1 live CD: works FreeBSD 12.0-CURRENT live CD: doesn't work
So, with a X710 and ixl it works too:
18.7 no VLANs:
ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=6402b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
ether 3c:fd:fe:9e:e7:48
hwaddr 3c:fd:fe:9e:e7:48
inet6 fe80::3efd:feff:fe9e:e748%ixl0 prefixlen 64 scopeid 0x1
inet 10.55.1.1 netmask 0xffffff00 broadcast 10.55.1.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
status: active
18.7. with VLANs:
ixl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=6002a8<VLAN_MTU,JUMBO_MTU,VLAN_HWCSUM,TSO6,RXCSUM_IPV6,TXCSUM_IPV6>
ether 3c:fd:fe:9e:e7:48
hwaddr 3c:fd:fe:9e:e7:48
inet6 fe80::3efd:feff:fe9e:e748%ixl0 prefixlen 64 scopeid 0x1
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
status: active
ixl0_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 3c:fd:fe:9e:e7:48
inet6 fe80::3efd:feff:fe9e:e748%ixl0_vlan111 prefixlen 64 scopeid 0xa
inet 10.55.2.1 netmask 0xffffff00 broadcast 10.55.2.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex>)
status: active
vlan: 111 vlanpcp: 0 parent interface: ixl0
groups: vlan
ixl0@pci0:1:0:0: class=0x020000 card=0x00088086 chip=0x15728086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = 'Ethernet Controller X710 for 10GbE SFP+'
class = network
subclass = ethernet
ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.9.9-k> mem 0xdd000000-0xdd7fffff,0xdd808000-0xdd80ffff irq 16 at device 0.0 on pci1
ixl0: using 1024 tx descriptors and 1024 rx descriptors
ixl0: fw 5.0.40043 api 1.5 nvm 5.05 etid 80002892 oem 1.262.0
ixl0: PF-ID[0]: VFs 64, MSIX 129, VF MSIX 5, QPs 768, I2C
ixl0: Using MSIX interrupts with 9 vectors
ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl0: Ethernet address: 3c:fd:fe:9e:e7:48
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: SR-IOV ready
ixl0: netmap queues/slots: TX 8/1024, RX 8/1024
@mimugmail My problem is actually a little different, it is a half-way house of what the rest of you seem to be dealing with.
For me the summary would be:
OPNsense 18.1.13 works across the board, no problems.
OPNsense 18.7 seems to work when the ix NICs are utilized without VLANs. To elaborate a bit: I use a Supermicro AOC-STGN-i2S card (Intel 82599), which provides two ix NICs via SFP+ ports for each of my two firewall nodes. ix0 on each node is the WAN port (and as such has no VLANs or other special stuff), and is connected to a Juniper EX3300 switch via DAC cables. ix1 on each node is connected, also via DAC cables, to a D-Link DGS-3420-52T switch. ix1 is configured only with VLANs, 5 in total, and has no non-VLAN configuration. When I boot into the 18.7 kernel, ix0 works fine and I can access the Internet without any problems, but ix1 are completely dead saying "no carrier". Before upgrading, this setup had been working swimmingly, no incompatibilities at all to speak of for over a year. So my problem seems to be exclusively with VLANs in regard to 18.7 and ix NICs.
I have not tried downgrading, because it only took a boot into kernel.old to make things work again, and it was not necessary to power cycle, just a simple reboot was enough.
I have another AOC-STGN-i2S card in one of my AMD-based servers (Which admittedly is quite a different beast than my firewalls, which are Atom C2000-based) , and when booting the FreeBSD 11.2 install media in live mode, this is what happens: a) The first thing I notice is that, even when unconfigured, the switch shows green lights on those ports. b) I can successfully configure both ix0 and ix1 as non-VLAN interfaces. c) I can successfully configure both ix0 and ix1 as VLAN interfaces.
I have not tried loop-backing the NICs, because ix0 to the Juniper switch works, and in my server both NICs to the D-Link switch works as well. As far as I can tell, I am not suffering from hardware incompatibilities.
From my perspective, vanilla FreeBSD 11.2 seems to work fine for my AOC-STGN-i2S card. So my first thought is that the backport of the driver has a problem for some reason, but I am no kernel hacker at all, so I cannot really tell.
Next 18.7 system with ix0, X520 NIC, also working fine. I'm trying to search for a 10G switch to reproduce, but atm it seems to be a very specific problem :(
root@OPNsense:~ # ifconfig ix0
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=c000a8<VLAN_MTU,JUMBO_MTU,VLAN_HWCSUM,TXCSUM_IPV6>
ether 90:e2:ba:39:1f:10
hwaddr 90:e2:ba:39:1f:10
inet6 fe80::92e2:baff:fe39:1f10%ix0 prefixlen 64 scopeid 0x1
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
status: active
root@OPNsense:~ # ifconfig ix0_vlan111
ix0_vlan111: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether 90:e2:ba:39:1f:10
inet6 fe80::92e2:baff:fe39:1f10%ix0_vlan111 prefixlen 64 scopeid 0x9
inet 10.55.2.2 netmask 0xffffff00 broadcast 10.55.2.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-Twinax <full-duplex,rxpause,txpause>)
status: active
vlan: 111 vlanpcp: 0 parent interface: ix0
groups: vlan
root@OPNsense:~ # dmesg | grep ix0
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.2.12-k> port 0xe020-0xe03f mem 0xdde80000-0xddefffff,0xddf04000-0xddf07fff irq 16 at device 0.0 on pci1
ix0: Using MSI-X interrupts with 9 vectors
ix0: Ethernet address: 90:e2:ba:39:1f:10
ix0: PCI Express Bus: Speed 5.0GT/s Width x4
ix0: netmap queues/slots: TX 8/2048, RX 8/2048
vlan0: changing name to 'ix0_vlan111'
ix0: link state changed to UP
ix0_vlan111: link state changed to UP
ix0@pci0:1:0:0: class=0x020000 card=0x00038086 chip=0x10fb8086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
class = network
subclass = ethernet
EDIT: This is a Fujitsu RX1300 system ...
Contacted Netgear, they check with their team...
This may be related: https://github.com/opnsense/core/commit/4ba0fa679
Try to use Interfaces: Settings: VLAN Hardware Filtering: "Leave default".
Doesn't help :(
Ok, bummer. It was a long shot :(
Tried today with an M5300-52G Netgear Switch (11.0.0.31, B1.0.0.5). Same result: no link.
Was this issue resolve or somewhat alleviated in 18.7.1? I ask because I could not find anything about it in the release notes.
No change in 18.7.1. kernel.old is phased out (now is the 18.7 kernel because there is a new 18.7.1 kernel), but the manual revert to 18.1.11 should still work minus the set verification (i):
# opnsense-update -ikr 18.1.11 -n "18.1\/dummy"
After upgrading to opnsense 18.7 the ix NIC (attached with a DAC to a switch) reports "media: no carrier". Setting the media to fixed 10Gbase-Twinax doesn't help...