Closed JeremyAshton-shopify closed 8 years ago
What did you do to produce this? Please provide your configuration/steps you followed
$ cat /etc/pf_ring/pf_ring.conf
enable_tx_capture=1
$ cat /etc/pf_ring/hugepages.conf
node=0 hugepagenumber=32768
$ cat /etc/pf_ring/zc/i40e/i40e.conf
RSS=1,1
description "zbalance_ipc packet capture fanout - <isp4>"
start on started pf_ring
stop on stopping pf_ring
respawn
respawn limit 1 10 # At most once per 10 seconds
pre-start script
/sbin/ethtool -A <isp4> autoneg off rx off tx off &> /dev/null || true
/sbin/ethtool -G <isp4> rx 4096 tx 4096 &> /dev/null || true
/sbin/ethtool -K <isp4> tx off rx off sg off tso off gso off gro off rxvlan off txvlan off ntuple off rxhash off &> /dev/null || true
/sbin/ethtool -L <isp4> combined 1 &> /dev/null || true
/sbin/ip link set <isp4> up arp off multicast off mtu 9000 promisc on &> /dev/null || true
end script
script
/usr/bin/chpst -u root:root /usr/local/bin/zbalance_ipc -i zc:<isp4> -c 40 -n 2 -m 2 -g 0 -p
end script
Now it is generating the following:
Apr 14 19:07:31 <redacted> kernel: [80301.916281] [PF_RING-ZC] i40e_down: detach <isp4>
Apr 14 19:07:31 <redacted> kernel: [80301.916285] [PF_RING-ZC] i40e_down: detaching <isp4> while in use
Apr 14 19:07:31 <redacted> kernel: [80301.918585] i40e 0000:02:00.1 <isp4>: adding 0c:c4:7a:bb:f7:65 vid=0
Apr 14 19:07:31 <redacted> kernel: [80301.920341] [PF_RING-ZC] i40e_up_complete: attach <isp4> [pf start=3279945728 len=8388608][cache_line_size=64][MSIX enabled]
Apr 14 19:07:32 <redacted> kernel: [80303.533973] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:15]
Apr 14 19:07:32 <redacted> kernel: [80303.542315] Modules linked in: 8021q garp mrp stp llc ipmi_devintf i40e_zc(OE) configfs pf_ring(OE) bonding ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ast ttm drm_kms_helper sb_edac joydev edac_core drm i2c_algo_bit syscopyarea sysfillrect sysimgblt lpc_ich mei_me ioatdma shpchp mei wmi ipmi_si 8250_fintek ipmi_msghandler acpi_power_meter mac_hid lp parport xfs libcrc32c ses enclosure raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor ixgbe hid_generic dca vxlan raid6_pq ip6_udp_tunnel udp_tunnel usbhid raid1 ahci ptp hid libahci raid0 megaraid_sas mdio pps_core multipath linear [last unloaded: i40e]
Apr 14 19:07:32 <redacted> kernel: [80303.542361] CPU: 1 PID: 15 Comm: migration/1 Tainted: G OEL 3.19.0-58-generic #64~14.04.1-Ubuntu
Apr 14 19:07:32 <redacted> kernel: [80303.542362] Hardware name: Supermicro SSG-6048R-E1CR36H/X10DRH-iT, BIOS 1.1 05/15/2015
Apr 14 19:07:32 <redacted> kernel: [80303.542364] task: ffff883f665089d0 ti: ffff883f66514000 task.ti: ffff883f66514000
Apr 14 19:07:32 <redacted> kernel: [80303.542365] RIP: 0010:[<ffffffff8110edcc>] [<ffffffff8110edcc>] multi_cpu_stop+0x5c/0xf0
Apr 14 19:07:32 <redacted> kernel: [80303.542369] RSP: 0000:ffff883f66517d68 EFLAGS: 00000246
Apr 14 19:07:32 <redacted> kernel: [80303.542370] RAX: ffffffff81815720 RBX: ffff882f4aa044b0 RCX: 000000000000250a
Apr 14 19:07:32 <redacted> kernel: [80303.542371] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff887f64957ab0
Apr 14 19:07:32 <redacted> kernel: [80303.542371] RBP: ffff883f66517d88 R08: ffff883f66514000 R09: 0000000000000001
Apr 14 19:07:32 <redacted> kernel: [80303.542372] R10: 00000000000027f8 R11: 0000000000000002 R12: ffffffff8109f390
Apr 14 19:07:32 <redacted> kernel: [80303.542373] R13: ffff883f66517ce8 R14: ffff883f7f913e80 R15: ffff883f7f913e80
Apr 14 19:07:32 <redacted> kernel: [80303.542374] FS: 0000000000000000(0000) GS:ffff883f7f880000(0000) knlGS:0000000000000000
Apr 14 19:07:32 <redacted> kernel: [80303.542375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 19:07:32 <redacted> kernel: [80303.542376] CR2: 00007f44b0a03f77 CR3: 0000000001c16000 CR4: 00000000001407e0
Apr 14 19:07:32 <redacted> kernel: [80303.542377] Stack:
Apr 14 19:07:32 <redacted> kernel: [80303.542378] ffff887f64957ad8 ffff883f7f88e9e0 ffff883f7f88e9e8 ffff887f64957ab0
Apr 14 19:07:32 <redacted> kernel: [80303.542380] ffff883f66517e58 ffffffff8110eff2 ffff883f66508fa8 ffffffff8181bac0
Apr 14 19:07:32 <redacted> kernel: [80303.542381] ffff883f7f893e80 ffff883f7f893e80 0000000000000000 ffff883f636dbb80
Apr 14 19:07:32 <redacted> kernel: [80303.542383] Call Trace:
Apr 14 19:07:32 <redacted> kernel: [80303.542386] [<ffffffff8110eff2>] cpu_stopper_thread+0x82/0x170
Apr 14 19:07:32 <redacted> kernel: [80303.542390] [<ffffffff817b9e4b>] ? __schedule+0x37b/0x850
Apr 14 19:07:32 <redacted> kernel: [80303.542394] [<ffffffff81098c13>] smpboot_thread_fn+0x143/0x1b0
Apr 14 19:07:32 <redacted> kernel: [80303.542396] [<ffffffff81098ad0>] ? SyS_setgroups+0x180/0x180
Apr 14 19:07:32 <redacted> kernel: [80303.542399] [<ffffffff81094db2>] kthread+0xd2/0xf0
Apr 14 19:07:32 <redacted> kernel: [80303.542402] [<ffffffff81094ce0>] ? kthread_create_on_node+0x1c0/0x1c0
Apr 14 19:07:32 <redacted> kernel: [80303.542405] [<ffffffff817be6d8>] ret_from_fork+0x58/0x90
Apr 14 19:07:32 <redacted> kernel: [80303.542407] [<ffffffff81094ce0>] ? kthread_create_on_node+0x1c0/0x1c0
Apr 14 19:07:32 <redacted> kernel: [80303.542408] Code: 89 db 48 0f a3 18 19 db 85 db 41 0f 95 c5 31 f6 31 d2 eb 1c 0f 1f 00 83 fb 03 75 05 45 84 ed 75 66 f0 41 ff 4c 24 24 74 26 89 da <83> fa 04 74 3d f3 90 41 8b 5c 24 20 39 d3 74 f0 83 fb 02 75 d7
Not sure what is causing last trace, however I made some changes to the pf_ring kernel module to avoid playing with promisc when already configured via ip/ifconfig, to make sure this is not caused by the card reset when setting promisc on this card model. Please update (reloading pf_ring.ko), keep setting the promisc in your pre-start script, and let us know. Thank you.
This does not appear to have any effect.
Apr 18 19:22:19 <redacted> kernel: [ 597.821564] [PF_RING-ZC] i40e_up_complete: attach <isp4> [pf start=3279945728 len=8388608][cache_line_size=64][MSIX enabled]
Apr 18 19:22:30 <redacted> kernel: [ 608.897774] i40e 0000:04:00.1: ARQ: Update LLDP MIB event received
Apr 18 19:22:30 <redacted> kernel: [ 608.897780] i40e 0000:04:00.1: LLDP event mib bridge type 0x0
Apr 18 19:22:30 <redacted> kernel: [ 608.897782] i40e 0000:04:00.1: LLDP event mib type remote
Apr 18 19:22:30 <redacted> kernel: [ 608.941081] i40e 0000:04:00.1: ARQ: Update LLDP MIB event received
Apr 18 19:22:30 <redacted> kernel: [ 608.941085] i40e 0000:04:00.1: LLDP event mib bridge type 0x0
Apr 18 19:22:30 <redacted> kernel: [ 608.941086] i40e 0000:04:00.1: LLDP event mib type local
Apr 18 19:22:30 <redacted> kernel: [ 608.946128] i40e 0000:04:00.1: ETS UP2TC changed.
Apr 18 19:22:30 <redacted> kernel: [ 608.946132] i40e 0000:04:00.1: ETS TC BW Table changed.
Apr 18 19:22:30 <redacted> kernel: [ 608.946136] i40e 0000:04:00.1: ETS TSA Table changed.
Apr 18 19:22:30 <redacted> kernel: [ 608.946137] i40e 0000:04:00.1: PFC config change detected.
Apr 18 19:22:30 <redacted> kernel: [ 608.946139] i40e 0000:04:00.1: dcb need_reconfig=1
Apr 18 19:22:30 <redacted> kernel: [ 608.948503] [PF_RING-ZC] i40e_down: detach <isp4>
Apr 18 19:22:30 <redacted> kernel: [ 608.948504] [PF_RING-ZC] i40e_down: detaching <isp4> while in use
Apr 18 19:22:30 <redacted> kernel: [ 608.950680] i40e 0000:04:00.1: VEB bw config failed, err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_EACCES
Apr 18 19:22:30 <redacted> kernel: [ 608.950682] i40e 0000:04:00.1: Failed configuring TC for VEB seid=290
Apr 18 19:22:30 <redacted> kernel: [ 608.951749] i40e 0000:04:00.1: AQ command Config VSI BW allocation per TC failed = 14
Apr 18 19:22:30 <redacted> kernel: [ 608.951752] i40e 0000:04:00.1: Failed configuring TC map 255 for VSI 526
Apr 18 19:22:30 <redacted> kernel: [ 608.951754] i40e 0000:04:00.1: Failed configuring TC for VSI seid=526
Apr 18 19:22:30 <redacted> kernel: [ 608.954025] i40e 0000:04:00.1: VSI seid 0 Tx ring 1 disable timeout
Apr 18 19:22:30 <redacted> kernel: [ 608.954029] i40e 0000:04:00.1: ARQ: Update LLDP MIB event received
Apr 18 19:22:30 <redacted> kernel: [ 608.954094] i40e 0000:04:00.1: LLDP event mib bridge type 0x4
Apr 18 19:22:30 <redacted> kernel: [ 608.954098] i40e 0000:04:00.1: ARQ: Update LLDP MIB event received
Apr 18 19:22:30 <redacted> kernel: [ 608.954100] i40e 0000:04:00.1: LLDP event mib bridge type 0x0
Apr 18 19:22:30 <redacted> kernel: [ 608.954101] i40e 0000:04:00.1: LLDP event mib type local
Apr 18 19:22:30 <redacted> kernel: [ 608.960153] i40e 0000:04:00.1: No change detected in DCBX configuration.
Apr 18 19:22:31 <redacted> kernel: [ 609.641835] i40e 0000:04:00.1: PFR requested
Apr 18 19:22:31 <redacted> kernel: [ 609.641840] i40e 0000:04:00.1: Tearing down internal switch for reset
Apr 18 19:22:31 <redacted> kernel: [ 609.651602] i40e 0000:04:00.1: Rebuilding internal switch
Apr 18 19:22:31 <redacted> kernel: [ 609.684028] i40e 0000:04:00.1: DCBX offload is supported for this PF.
Apr 18 19:22:31 <redacted> kernel: [ 609.684031] i40e 0000:04:00.1: FCoE capability is disabled
Apr 18 19:22:31 <redacted> kernel: [ 609.802483] i40e 0000:04:00.1: attempting to rebuild PF VSI
Apr 18 19:22:31 <redacted> kernel: [ 609.806590] i40e 0000:04:00.1: AQ command Config VSI BW allocation per TC failed = 14
Apr 18 19:22:31 <redacted> kernel: [ 609.806592] i40e 0000:04:00.1: Failed configuring TC map 255 for VSI 526
Apr 18 19:22:31 <redacted> kernel: [ 609.806596] i40e 0000:04:00.1: failed to configure TCs for main VSI tc_map 0x000000ff, err I40E_ERR_INVALID_QP_ID aq_err I40E_AQ_RC_EINVAL
Questions:
With the stock intel drivers I seem to keep receiving this message:
Apr 19 16:53:47 <redacted> kernel: [ 759.321988] i40e 0000:02:00.1: veb bw config failed, aq_err=14
Apr 19 16:53:47 <redacted> kernel: [ 759.321995] i40e 0000:02:00.1: Failed configuring TC for VEB seid=289
Apr 19 16:53:47 <redacted> kernel: [ 759.324087] i40e 0000:02:00.1: AQ command Config VSI BW allocation per TC failed = 14
Apr 19 16:53:47 <redacted> kernel: [ 759.324093] i40e 0000:02:00.1: Failed configuring TC map 255 for VSI 526
Apr 19 16:53:47 <redacted> kernel: [ 759.324098] i40e 0000:02:00.1: Failed configuring TC for VSI seid=526
Apr 19 16:53:47 <redacted> kernel: [ 759.326154] i40e 0000:02:00.1: AQ command Config VSI BW allocation per TC failed = 14
Apr 19 16:53:47 <redacted> kernel: [ 759.326159] i40e 0000:02:00.1: Failed configuring TC map 1 for VSI 0
Apr 19 16:53:47 <redacted> kernel: [ 759.326163] i40e 0000:02:00.1: Failed configuring TC for VSI seid=0
Apr 19 16:53:47 <redacted> kernel: [ 759.328397] i40e 0000:02:00.1: i40e_vsi_wait_txq_disabled: VSI seid 0 Tx ring 64 disable timeout
I think it is time for me to abandon the i40e based chipset. There seems to be too much broken upstream.
still looks like DCB/LLDP is enabled to me.
It is disabled on the switch. And, lldpd has been removed from the node.
Once every hour I see a LLDP broadcast but have no idea where it is coming from. The message originates from the node in question... but no idea which process.
It turns out there is an embedded lldp daemon (agent) inside the NIC firmware. It apparently can be disabled via echo "lldp stop" > /sys/kernel/debug/i40e/<bus_address>/command
.
Please note we upgraded, not sure it works better for you, worth a try.
Closing this issue as the attach/detach issue should be solved.
Now that we have implemented a work-around for https://github.com/ntop/PF_RING/issues/81 I am investigating the following errors:
dkms:
I wonder if someone might offer insight as to next steps.