Multithread netmap with i40e driver is unstable when application is uncontrollably closed

jmtilli commented 6 years ago

I have written a netmap application that interfaces to an i40e card using multiple queues. Packets are sent from multiple threads. The main thread calls pthread_join() for all of the other threads.

When I increase the thread count to a large value (such as 10), and run the application, and then close the application using Ctrl-C, the entire machine very often crashes. Here's one example:

Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298390] BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298432] IP: i40e_vsi_configure+0x62a/0x7b0 [i40e]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298448] PGD 0 
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298449] 
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298464] Oops: 0000 [#1] SMP
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298477] Modules linked in: i40e(OE) netmap(OE) openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack binfmt_misc intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif dcdbas irqbypass mei_me joydev input_leds intel_cstate intel_rapl_perf shpchp mei ioatdma mac_hid acpi_power_meter dca ipmi_si ipmi_devintf ipmi_msghandler lpc_ich acpi_pad ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear mgag200 i2c_algo_bit crct10dif_pclmul crc32_pclmul ttm ghash_clmulni_intel pcbc hid_generic bnx2x drm_kms_helper aesni_intel
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298700]  syscopyarea sysfillrect sysimgblt aes_x86_64 fb_sys_fops crypto_simd usbhid glue_helper ahci ptp hid megaraid_sas mdio drm pps_core libahci cryptd libcrc32c wmi [last unloaded: i40e]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298758] CPU: 4 PID: 18744 Comm: netmapsend Tainted: G           OE   4.11.0-14-generic #20~16.04.1-Ubuntu
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298787] Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.5.4 01/22/2016
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298809] task: ffff90f519eb4380 task.stack: ffffa5584c780000
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298831] RIP: 0010:i40e_vsi_configure+0x62a/0x7b0 [i40e]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298849] RSP: 0018:ffffa5584c783ac8 EFLAGS: 00010286
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298866] RAX: 0000000000000100 RBX: ffff90fcc37abb80 RCX: 0000000000000000
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298887] RDX: 0000000000000000 RSI: 00000000000001ff RDI: ffff90fd14d1a000
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298908] RBP: ffffa5584c783b40 R08: ffffa5584f2f1000 R09: 0000000000000015
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298929] R10: 0000000000000001 R11: 0000000000000019 R12: ffff90fd159a1900
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298949] R13: ffff90f514274800 R14: ffff90fd1aca9000 R15: ffff90fd1aca9000
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298971] FS:  00007fb72ca16700(0000) GS:ffff90f51f880000(0000) knlGS:0000000000000000
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.298994] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299011] CR2: 0000000000000100 CR3: 00000009d4e09000 CR4: 00000000001406e0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299032] Call Trace:
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299048]  i40e_up+0x12/0x30 [i40e]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299065]  i40e_netmap_reg+0x72/0x180 [i40e]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299085]  netmap_hw_reg+0x34/0x90 [netmap]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299103]  netmap_do_unregif+0x90/0x240 [netmap]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299121]  netmap_priv_delete+0x37/0x80 [netmap]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299138]  netmap_dtor+0x21/0x30 [netmap]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299154]  linux_netmap_release+0x1a/0x30 [netmap]
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299174]  __fput+0xe7/0x220
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299186]  ____fput+0xe/0x10
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299200]  task_work_run+0x7e/0xa0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299214]  do_exit+0x2e2/0xb60
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299228]  ? __unqueue_futex+0x2f/0x60
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299242]  do_group_exit+0x43/0xb0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299256]  get_signal+0x293/0x620
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299270]  do_signal+0x37/0x750
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299283]  ? do_futex+0x210/0x530
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.299957]  exit_to_usermode_loop+0x76/0xb0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.300622]  syscall_return_slowpath+0x59/0x60
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.301291]  entry_SYSCALL_64_fastpath+0xab/0xad
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.301984] RIP: 0033:0x7fb72c5ed98d
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.302651] RSP: 002b:00007ffd548cd4a0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.303301] RAX: fffffffffffffe00 RBX: 00007fb65fad0700 RCX: 00007fb72c5ed98d
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.303954] RDX: 000000000000493a RSI: 0000000000000000 RDI: 00007fb65fad09d0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.304616] RBP: 0000000000000000 R08: 00000000000000ca R09: 00007fb65fad09d0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.305277] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd548cd4b0
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.305952] R13: 0000000000000000 R14: 00007fb72ca16700 R15: 00007fb65fad0d28
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.306592] Code: 03 53 2c 78 38 41 89 d0 41 29 f8 39 d7 41 0f 4e d0 48 63 d2 48 63 f9 4d 8b 85 18 01 00 00 48 c1 e2 04 48 c1 e7 04 49 03 7c 24 08 <8b> 14 10 41 3b 95 20 01 00 00 73 ad 49 8b 14 d0 eb aa 01 fa eb 
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.307862] RIP: i40e_vsi_configure+0x62a/0x7b0 [i40e] RSP: ffffa5584c783ac8
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.308469] CR2: 0000000000000100
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.309057] ---[ end trace 9a66ddaf2940e6c7 ]---
Dec 18 17:45:22 node-kudu1 kernel: [ 3527.311787] Fixing recursive fault but reboot is needed!

As a workaround of this, I have added mechanism to the program to close controllably so that SIGINT signal is handled and the threads sending packets then close, and afterwards the main thread calls nm_close() for all descriptors and then closes itself. This controlled close mechanism seems to solve the issues but is extra work to implement.

The issues seem to happen both by using netmap directly and also by using netmap via OpenDataPlane. I have observed the issue on a variety of different packet processing applications, using threaded architectures in a very different way. I believe the fact that packets are actively sent makes the crash more likely, so if you interrupt an idle netmap application that is just a filter for packets, the likelihood of crash may be lower.

VALE doesn't seem to suffer from the problems but the i40e driver does.

The more threads there are, the larger is the likelihood of the machine crashing.

The problem seems to happen both with an old version of netmap (3f4d599106cdbe684f49a3da7df04739670736e8 from June 2016, compiled with Linux 3.16 source code on Debian 8 Jessie) and with a new version of netmap (acd7ac033c4e2eb7c66f556ebe1800d7046e3020 from December 2017, compiled with i40e 2.3.6 downloaded automatically). So this isn't just a recent problem, it has been there for a very long amount of time.

vmaffione commented 6 years ago

We could at least figure out which pointer is NULL, adding some D() statements on unchecked pointers. I think it must be something in these functions https://github.com/luigirizzo/netmap/blob/master/LINUX/i40e_netmap_linux.h#L125-L195

jmtilli commented 6 years ago

Hi, just wondering... There's this code:

static int
i40e_netmap_reg(struct netmap_adapter *na, int onoff)
{
        struct ifnet *ifp = na->ifp;
        struct i40e_netdev_priv *np = netdev_priv(ifp);
        struct i40e_vsi  *vsi = np->vsi;
        struct i40e_pf   *pf = (struct i40e_pf *)vsi->back;
        bool was_running;

        while (test_and_set_bit(__I40E_CONFIG_BUSY, NM_I40E_STATE(pf)))
                        usleep_range(1000, 2000);

Should the vsi pointer be assigned only after the usleep has been completed?

I'm suspect vsi being NULL is the cause of this issue although I haven't verified the cause with debug prints. I'll have to contact the system administrator to provide me remote reboot access to the server having the i40e card as it's currently crashed again!

vmaffione commented 6 years ago

The NULL pointer is probably in this line https://github.com/luigirizzo/netmap/blob/master/LINUX/i40e_netmap_linux.h#L185 since ring->desc is probably NULL.

vmaffione commented 6 years ago

vsi can't be assigned after the test and set loop, because pf is needed to run the loop (and pf is obtained from vsi.

jmtilli commented 6 years ago

Did you notice the NULL pointer dereference happened in the function i40e_vsi_configure? This function is not part of the netmap modifications, it's part of the regular driver. So somehow netmap is feeding the regular driver a NULL pointer. But yes, on closer inspection you're right: vsi can't be NULL.

Anyway, since you know more about this than I do, could you propose some patch to the i40e driver and/or the netmap modifications to it? I have access to the i40e machine, and I can try to reproduce the issue with the added debug prints.

vmaffione commented 6 years ago

I've seen the stack trace, that's why I'm saying the bug is probably in https://github.com/luigirizzo/netmap/blob/master/LINUX/i40e_netmap_linux.h#L125-L195 i40e_vsi_configure calls those functions, and so it is actually patched by netmap. Could you check ring->desc as suggested above?

jmtilli commented 6 years ago

Hi, I checked now na for being NULL, ring->desc for being NULL and rx for being NULL. Something is NULL in the inlined functions called from i40e_vsi_configure, but it was none of these. Here's the diff:

diff --git a/LINUX/i40e_netmap_linux.h b/LINUX/i40e_netmap_linux.h
index b65ee4c..a5a43aa 100644
--- a/LINUX/i40e_netmap_linux.h
+++ b/LINUX/i40e_netmap_linux.h
@@ -170,6 +170,9 @@ i40e_netmap_configure_rx_ring(struct i40e_ring *ring)
        }

        na = NA(ring->netdev);
+       if (na == NULL) {
+               D("na is NULL");
+       }
        ring_nr = ring->queue_index;
        kring = &na->rx_rings[ring_nr];

@@ -182,7 +185,13 @@ i40e_netmap_configure_rx_ring(struct i40e_ring *ring)
        for (i = 0; i < na->num_rx_desc; i++) {
                int si = netmap_idx_n2k(kring, i);
                uint64_t paddr;
+               if (ring->desc == NULL) {
+                       D("ring->desc is NULL");
+               }
                union i40e_rx_desc *rx = I40E_RX_DESC(ring, i);
+               if (rx == NULL) {
+                       D("rx is NULL");
+               }
                PNMB(na, slot + si, &paddr);

                rx->read.pkt_addr = htole64(paddr);

Any other ideas?

jmtilli commented 6 years ago

I investigated the user space part of the problem. I open 6 NIC rings for downlink (dl) and 6 VALE port rings for uplink (ul). Here's the output from my user space application:

closing ul 0
closing dl 0
closed dl 0
closing ul 1
closing dl 1

Note how there's no "closed dl 1". ("closed ul 0" and "closed ul 1" are omitted because it would be just before the "closing dl 0" or "closing dl 1" line).

So, i40e seems to hang when the second nm_close call is called for the 6 NIC rings. The first nm_close does not break anything yet.

jmtilli commented 6 years ago

Also, my modifications isolate the problem to i40e_netmap_configure_rx_ring. I added lots of debug print lines to this function, and it seems not to be inlined anymore. Here's the backtrace:

[  262.181151] BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
[  262.181443] IP: i40e_netmap_configure_rx_ring+0x247/0x350 [i40e]
[  262.181728] PGD 100bde7067 
[  262.181729] PUD 1017593067 
[  262.182010] PMD 0 

[  262.182850] Oops: 0000 [#1] SMP
[  262.183135] Modules linked in: i40e(OE) netmap(OE) openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack binfmt_misc ipmi_ssif intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dcdbas irqbypass intel_cstate input_leds joydev intel_rapl_perf mei_me mei ioatdma shpchp lpc_ich dca mac_hid ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear mgag200 i2c_algo_bit ttm crct10dif_pclmul crc32_pclmul drm_kms_helper ghash_clmulni_intel pcbc syscopyarea hid_generic sysfillrect
[  262.185932]  bnx2x aesni_intel sysimgblt fb_sys_fops usbhid aes_x86_64 crypto_simd ahci glue_helper ptp drm pps_core cryptd hid libahci megaraid_sas mdio libcrc32c wmi [last unloaded: i40e]
[  262.186905] CPU: 16 PID: 6137 Comm: netmapproxy1 Tainted: G           OE   4.11.0-14-generic #20~16.04.1-Ubuntu
[  262.187423] Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.5.4 01/22/2016
[  262.187953] task: ffff9fedcc858000 task.stack: ffffb46448acc000
[  262.188531] RIP: 0010:i40e_netmap_configure_rx_ring+0x247/0x350 [i40e]
[  262.189076] RSP: 0018:ffffb46448acfc90 EFLAGS: 00010246
[  262.189623] RAX: 0000000000000000 RBX: ffff9ff5d8e33300 RCX: 0000000000000100
[  262.190185] RDX: ffffb46449d32000 RSI: 0000000000000096 RDI: ffff9feddfa0dc20
[  262.190757] RBP: ffffb46448acfce8 R08: 0000000000000000 R09: 0000000000000827
[  262.191330] R10: ffff9ff5d4ff2000 R11: 0000000000000001 R12: 0000000000000000
[  262.191903] R13: ffff9ff5ceac7000 R14: ffff9ff5d444ac40 R15: 0000000000000000
[  262.192539] FS:  00007f4ddc3d2700(0000) GS:ffff9feddfa00000(0000) knlGS:0000000000000000
[  262.193132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  262.193726] CR2: 0000000000000100 CR3: 000000100e740000 CR4: 00000000001406e0
[  262.194330] Call Trace:
[  262.194941]  i40e_vsi_configure+0x43e/0x6b0 [i40e]
[  262.195563]  i40e_up+0x12/0x30 [i40e]
[  262.196195]  i40e_netmap_reg+0x72/0x180 [i40e]
[  262.196839]  netmap_hw_reg+0x34/0x90 [netmap]
[  262.197465]  netmap_do_unregif+0x90/0x240 [netmap]
[  262.198096]  netmap_priv_delete+0x37/0x80 [netmap]
[  262.198725]  netmap_dtor+0x21/0x30 [netmap]
[  262.199353]  linux_netmap_release+0x1a/0x30 [netmap]
[  262.199991]  __fput+0xe7/0x220
[  262.200687]  ____fput+0xe/0x10
[  262.201323]  task_work_run+0x7e/0xa0
[  262.201960]  exit_to_usermode_loop+0xaa/0xb0
[  262.202601]  syscall_return_slowpath+0x59/0x60
[  262.203250]  entry_SYSCALL_64_fastpath+0xab/0xad
[  262.203896] RIP: 0033:0x7f4ddbfb257d
[  262.204576] RSP: 002b:00007ffc030e7ed0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[  262.205279] RAX: 0000000000000000 RBX: 0000000001de9970 RCX: 00007f4ddbfb257d
[  262.205955] RDX: 00007f4ddbf9e780 RSI: 0000000014721000 RDI: 0000000000000005
[  262.206625] RBP: 00007ffc030e8190 R08: 0000000000000000 R09: 000000000000000d
[  262.207285] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
[  262.207939] R13: 0000000000000000 R14: 0000000000000006 R15: 000560db85e3ad75
[  262.208656] Code: a4 7d c0 99 f7 f9 b9 ca 00 00 00 89 d6 8b 55 c8 e8 46 2c 21 fd 4c 8b 55 a8 48 8b 4d b8 49 63 c4 49 8b 95 18 01 00 00 48 c1 e0 04 <8b> 04 01 41 3b 85 20 01 00 00 0f 82 e7 00 00 00 48 8b 02 49 89 
[  262.210063] RIP: i40e_netmap_configure_rx_ring+0x247/0x350 [i40e] RSP: ffffb46448acfc90
[  262.210776] CR2: 0000000000000100
[  262.211479] ---[ end trace 98a79ee17c5ec45a ]---

I also added some debug prints to the function, and the last debug print is this: [ 262.180880] 859.878290 [ 192] i40e_netmap_configure_rx_ring 4 configure_rx_ring "5 configure_rx_ring" is not debug-printed at all even though it's there. This isolates where the problem is. It is in this for loop:

        for (i = 0; i < na->num_rx_desc; i++) {
                int si = netmap_idx_n2k(kring, i);
                uint64_t paddr;
                if (ring->desc == NULL) {
                        D("ring->desc is NULL");
                }
                union i40e_rx_desc *rx = I40E_RX_DESC(ring, i);
                if (rx == NULL) {
                        D("rx is NULL");
                }
                PNMB(na, slot + si, &paddr);

                rx->read.pkt_addr = htole64(paddr);
                rx->read.hdr_addr = 0;
        }

...but it's not rx or ring->desc. What could it be then?

jmtilli commented 6 years ago

More investigation with debug prints. Here's the latest print log:

[  203.210561] 475.582302 [ 198] i40e_netmap_configure_rx_ring ring->desc is ffff8a2748352000
[  203.211090] 475.582831 [ 203] i40e_netmap_configure_rx_ring rx is ffff8a2748352000
[  203.211619] 475.583360 [ 207] i40e_netmap_configure_rx_ring na->na_lut.lut is ffff993689141000
[  203.212151] 475.583892 [ 211] i40e_netmap_configure_rx_ring na->na_lut.plut is ffff993689bf1000
[  203.212699] 475.584440 [ 215] i40e_netmap_configure_rx_ring calling PNMB
[  203.213237] BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
[  203.213800] IP: i40e_netmap_configure_rx_ring+0x4b6/0x520 [i40e]

So none of the critical pointers seems to be NULL, but something crashes inside the PNMB call because it did not print "called PNMB" which I added after the PNMB call. It again crashed after the second downlink descriptor was closed.

jmtilli commented 6 years ago

The "slot" pointer seems to be the culprit. It's not NULL but close enough to a NULL to cause a NULL pointer dereference:

[  181.883742] 138.346855 [ 190] i40e_netmap_configure_rx_ring slot is 0000000000000100

markwebster commented 6 years ago

Same as #408

giuseppelettieri commented 6 years ago

may you please try the following patch?

diff --git a/sys/dev/netmap/netmap.c b/sys/dev/netmap/netmap.c
index 1681862..aa3d49c 100644
--- a/sys/dev/netmap/netmap.c
+++ b/sys/dev/netmap/netmap.c
@@ -3249,14 +3249,15 @@ netmap_reset(struct netmap_adapter *na, enum txrx tx, u_int n,

                kring = na->tx_rings + n;

-               if (kring->nr_mode == NKR_NETMAP_ON)
-                       return kring->ring->slot;
-
                if (kring->nr_pending_mode == NKR_NETMAP_OFF) {
                        kring->nr_mode = NKR_NETMAP_OFF;
                        return NULL;
                }

+               if (kring->nr_mode == NKR_NETMAP_ON)
+                       return kring->ring->slot;
+
+
                // XXX check whether we should use hwcur or rcur
                new_hwofs = kring->nr_hwcur - new_cur;
        } else {
@@ -3264,15 +3265,15 @@ netmap_reset(struct netmap_adapter *na, enum txrx tx, u_int n,
                        return NULL;
                kring = na->rx_rings + n;

-               if (kring->nr_mode == NKR_NETMAP_ON)
-                       return kring->ring->slot;
-
-
                if (kring->nr_pending_mode == NKR_NETMAP_OFF) {
                        kring->nr_mode = NKR_NETMAP_OFF;
                        return NULL;
                }

+               if (kring->nr_mode == NKR_NETMAP_ON)
+                       return kring->ring->slot;
+
+
                new_hwofs = kring->nr_hwtail - new_cur;
        }
        lim = kring->nkr_num_slots - 1;

jmtilli commented 6 years ago

Hi, I reverted commit 05ad1b244a7c327aa0ebf930cb715f2d61006f89, stopping the occurrence of the bug that reproduced every time I closed my application with ^C. I'll try your patch next.

giuseppelettieri commented 6 years ago

Yes, that commit is the source of the silly bug, but something else will break if you just remove it.

jmtilli commented 6 years ago

Yes, I tested now your patch. It works fine! I can now repeatedly start and stop my application. Thanks for the great and quick support provided!

giuseppelettieri commented 6 years ago

thanks to you for your patience.

@markwebster, may you check that the patch also works for you, and also hopefully does not introduce the crash you were seeing after the commit reversion?

markwebster commented 6 years ago

That's definitely fixed the crash! I'll run some more stress-tests and let you know if anything strange happens.

Thanks!

EDIT: After a few runs of pkt-gen, the kernel crashed hard. I'll paste the trace once the box has rebooted. EDIT2: system logs wre corrupted. I need a few minutes to reproduce the crash

jmtilli commented 6 years ago

Great! Should the patch be committed then? I assume this affects quite many users, as multi-queue is a standard feature on today's networked computer systems. At least the OpenDataPlane folks are supporting multiple queues now and netmap is one supported packet I/O mechanism there. And I can't saturate a 40Gbps NIC in my packet processing application without using three threads at average packet size of 1000 bytes, so multi-queue definitely is the way to go in modern applications.

markwebster commented 6 years ago

I can now reliably reproduce this:

[ 1192.108484] CPU: 8 PID: 60 Comm: ksoftirqd/8 Tainted: G          IO    4.9.0-3.sbc-amd64 #1 Debian 4.9.30-2+sbc1
[ 1192.108743] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
[ 1192.109001] task: ffff93c2228624c0 task.stack: ffffaa6483364000
[ 1192.109143] RIP: 0010:[<ffffffffa18854b2>]  [<ffffffffa18854b2>] __memcpy+0x12/0x20
[ 1192.109421] RSP: 0018:ffffaa6483367cc8  EFLAGS: 00010246
[ 1192.109562] RAX: ffff93c21fc7e240 RBX: ffffaa64881030e0 RCX: 0000000000000008
[ 1192.109706] RDX: 0000000000000000 RSI: 0000bebc00000000 RDI: ffff93c21fc7e240
[ 1192.109849] RBP: ffffaa6483367d30 R08: 0000000000000012 R09: 0000000000000000
[ 1192.109992] R10: ffff93bf2115b900 R11: 000000000000003c R12: ffff93c2221e7200
[ 1192.110136] R13: ffff93c2206d2070 R14: ffff93c221126100 R15: 0000bebc00000000
[ 1192.110280] FS:  0000000000000000(0000) GS:ffff93c22fd00000(0000) knlGS:0000000000000000
[ 1192.110538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1192.110680] CR2: 00007ffff890e1e8 CR3: 0000000615407000 CR4: 00000000000006e0
[ 1192.110820] Stack:
[ 1192.110953]  ffffffffc10cd323 0000000000000008 ffff93c2221e7040 000003840000000a
[ 1192.111277]  ffff93c2221e7000 0000000f0000003c 000faa6400000040 0000000000000100
[ 1192.111600]  0000000000000000 0000000000000040 ffff93c2221e7200 0000000000000001
[ 1192.111926] Call Trace:
[ 1192.112073]  [<ffffffffc10cd323>] ? ixgbe_clean_rx_irq+0x7a3/0xa20 [ixgbe_netmap]
[ 1192.112333]  [<ffffffffc10ce354>] ixgbe_poll+0xf4/0x690 [ixgbe_netmap]
[ 1192.112476]  [<ffffffffa1a1275e>] net_rx_action+0x1fe/0x340
[ 1192.112612]  [<ffffffffa1aaed6e>] __do_softirq+0xfe/0x280
[ 1192.112748]  [<ffffffffa165accc>] run_ksoftirqd+0x1c/0x30
[ 1192.112882]  [<ffffffffa16760da>] smpboot_thread_fn+0x10a/0x160
[ 1192.113018]  [<ffffffffa1675fd0>] ? sort_range+0x30/0x30
[ 1192.113154]  [<ffffffffa1672a3a>] kthread+0xca/0xe0
[ 1192.113292]  [<ffffffffa1672970>] ? kthread_park+0x60/0x60
[ 1192.113432]  [<ffffffffa1aac9d2>] ret_from_fork+0x22/0x30
[ 1192.113570] Code: 09 d0 48 31 c3 e9 78 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 66 90 66 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3
[ 1192.115080] RIP  [<ffffffffa18854b2>] __memcpy+0x12/0x20
[ 1192.115243]  RSP <ffffaa6483367cc8>
[ 1192.115389] ---[ end trace 20c6e988819f6c6f ]---

giuseppelettieri commented 6 years ago

Thanks, may you please check with a driver < 5.3.3? I think that this may be an unrelated bug, possibly related to #397 instead.

markwebster commented 6 years ago

Using ixgbe 5.1.3, I wasn't able to reach usual performance levels, and I got this crash after a few runs of pkt-gen:

[  625.406157] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[  625.406456] IP: [<ffffffffc1316315>] ixgbe_clean_rx_ring+0x135/0x260 [ixgbe_netmap]
[  625.406750] PGD 0

[  625.406937] Oops: 0002 [#1] SMP
[  625.407100] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc loop ipmi_ssif ipmi_devintf ixgbe_netmap(O) coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support gpio_ich netmap(O) aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev dcdbas intel_cstate intel_uncore ptp sg pps_core lpc_ich serio_raw pcspkr ipmi_si mfd_core ipmi_msghandler acpi_power_meter ioatdma i7core_edac edac_core dca acpi_cpufreq ext4 crc16 jbd2 crc32c_generic mbcache sd_mod hid_generic usbmouse usbkbd usbhid hid crc32c_intel mptsas scsi_transport_sas mptscsih uhci_hcd ehci_pci ehci_hcd scsi_mod mptbase usbcore usb_common bnx2
[  625.409536] CPU: 11 PID: 11503 Comm: pkt-gen Tainted: G          IO    4.9.0-3.sbc-amd64 #1 Debian 4.9.30-2+sbc1
[  625.409803] Hardware name: Dell Inc. PowerEdge R610/0F0XJ6, BIOS 6.4.0 07/23/2013
[  625.410064] task: ffff88bd5cc40100 task.stack: ffffa01f079d4000
[  625.410207] RIP: 0010:[<ffffffffc1316315>]  [<ffffffffc1316315>] ixgbe_clean_rx_ring+0x135/0x260 [ixgbe_netmap]
[  625.410495] RSP: 0018:ffffa01f079d7d20  EFLAGS: 00010287
[  625.410638] RAX: ffffffff9bea1300 RBX: ffffa01f081e9380 RCX: 0000000000000002
[  625.410784] RDX: 0000000000001000 RSI: 0000000000000000 RDI: ffff88c062a280a0
[  625.410931] RBP: ffffa01f079d7d50 R08: 0000000000000000 R09: 0000000000000000
[  625.411077] R10: 0000000000000000 R11: ffff88bd6fd586a8 R12: ffff88c05f68ee00
[  625.411222] R13: 000000000000001c R14: 0000000000001000 R15: 0000000000000000
[  625.411394] FS:  00007fb4860f0e80(0000) GS:ffff88bd6fd40000(0000) knlGS:0000000000000000
[  625.411669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  625.411811] CR2: 000000000000001c CR3: 00000003205ba000 CR4: 00000000000006e0
[  625.411956] Stack:
[  625.412091]  ffffa01f079d7d30 ffff88c060f50840 0000000000000001 ffff88c060f50ca0
[  625.412422]  0000000000000003 000000000000000c ffffa01f079d7d90 ffffffffc131c78e
[  625.412750]  ffff88c060f51500 ffff88c060f50840 0000000000000000 ffff88c060f50000
[  625.413078] Call Trace:
[  625.413221]  [<ffffffffc131c78e>] ixgbe_down+0x33e/0x450 [ixgbe_netmap]
[  625.413369]  [<ffffffffc131cc8d>] ixgbe_netmap_reg+0x1cd/0x260 [ixgbe_netmap]
[  625.413521]  [<ffffffffc117faa4>] netmap_hw_reg+0x34/0x90 [netmap]
[  625.413668]  [<ffffffffc1180bb1>] netmap_do_unregif+0x81/0x220 [netmap]
[  625.413815]  [<ffffffffc11837f7>] netmap_priv_delete+0x37/0x60 [netmap]
[  625.413963]  [<ffffffffc1183841>] netmap_dtor+0x21/0x30 [netmap]
[  625.414109]  [<ffffffffc118508a>] linux_netmap_release+0x1a/0x30 [netmap]
[  625.414258]  [<ffffffff9bda322f>] __fput+0xdf/0x1e0
[  625.414400]  [<ffffffff9bda339e>] ____fput+0xe/0x10
[  625.414543]  [<ffffffff9bc70ed0>] task_work_run+0x80/0xa0
[  625.414687]  [<ffffffff9bc02266>] exit_to_usermode_loop+0x86/0x90
[  625.414831]  [<ffffffff9bc029f3>] syscall_return_slowpath+0x53/0x60
[  625.414976]  [<ffffffff9c0ac7e3>] system_call_fast_compare_end+0x8b/0x8d
[  625.415120] Code: 48 8b 87 e8 01 00 00 48 85 c0 0f 84 02 01 00 00 48 8b 40 28 48 85 c0 74 0a 45 31 c0 b9 02 00 00 00 ff d0 4c 8b 7b 10 0f b7 73 1c <f0> 41 29 77 1c 0f 94 c0 66 66 66 66 90 84 c0 75 67 41 83 c5 01
[  625.416635] RIP  [<ffffffffc1316315>] ixgbe_clean_rx_ring+0x135/0x260 [ixgbe_netmap]
[  625.416911]  RSP <ffffa01f079d7d20>
[  625.417050] CR2: 000000000000001c
[  625.417194] ---[ end trace cb02d22733488d70 ]---

jmtilli commented 6 years ago

I'm starting to suspect the remaining issues may be related to the ixgbe driver. I didn't use pkt-gen; I used my own custom applications. Could @markwebster explain how to reproduce the ixgbe crash with pkt-gen, i.e. what kind of command line arguments are needed, so that I could try to reproduce the issues on my i40e setup?

All I have is one 24-CPU server with dual port i40e card, with a direct attach cable between the two ports of the card. No ixgbe card there at all.

markwebster commented 6 years ago

Ok, my test was a 2-way pkt-gen run:

Host-A: netmap dda974f882b439eee732672db747bfd677a12ac9 + ixgbe 4.5.4 (12 physical cores)

RX: pkt-gen -ieth4 -frx [-s* -S*] [-d* -D*] -w3 -p6
TX: pkt-gen -ieth4 -ftx [-s* -S*] [-d* -D*] -w3 -l60 -b64 -R1500000 -p6

Host-B: netmap latest + ixgbe-5.1.3 (12 physical cores)

RX: pkt-gen -ieth4 -frx [-s* -S*] [-d* -D*] -w3 -p6
TX: pkt-gen -ieth4 -ftx [-s* -S*] [-d* -D*] -w3 -l60 -b64 -R1500000 -p6

The crash happens (on Host-B) after a few runs.

jmtilli commented 6 years ago

Ok, I currently don't have two hosts that are connected to each other via i40e NICs. The second host I have doesn't have any cables now connected. After bit less than one month, we should be able to get a second 40 gigabit direct attach QSFP cable, at which point we are planning to introduce a loop topology to the two servers (first port on server 1 connected to first port on server 2, second port on server 1 connected to second port on server 2).

I could, however, try running the four pkt-gen applications on the same server, two on the first Ethernet port and two on the second Ethernet port. I would use a single netmap and i40e driver version then, no possibility to run two different netmap versions on the same system. I'm not sure if I have the time today for this experiment, but pretty soon I might have...

However, do note that you are running two netmap applications on the same network interface. Is this a supported configuration at all? Perhaps some netmap developer could explain whether you can run pkt-gen tx and pkt-gen rx at the same time using the same interface. At least I do see the critical NETMAP_NO_TX_POLL flag in the pkt-gen codebase.

What if you run only pkt-gen TX on server 1 and pkt-gen RX on server 2, or vice versa? I mean running only one pkt-gen instance on a given network interface. Can you reproduce the crash then?

markwebster commented 6 years ago

That's my standard pkt-gen test, which works fine with older netmap / ixgbe versions, but I don't know if it is actually intended to work(?)

I can reproduce the crash if I run the test one way (host-a TX / host-b RX). I found that the crash only happens if I stop pkt-gen on host-b while host-a is still TX. Which sounds like a race condition. There probably needs to be a mutex/guard to stop any RX packet processing during netmap shutdown.

markwebster commented 6 years ago

The same thing happens if I run the test using TX=port0, and RX=port1, on the same host. Fatal exception in interrupt (ixgbe_check_tx_hang)

jmtilli commented 6 years ago

I wonder what would happen if you modified pkt-gen to instead of this:

        for (i = 0; i < g->nthreads; i++) {
                struct timespec t_tic, t_toc;
                /*
                 * Join active threads, unregister interfaces and close
                 * file descriptors.
                 */
                if (targs[i].used)
                        pthread_join(targs[i].thread, NULL); /* blocking */
                if (g->dev_type == DEV_NETMAP) {
                        nm_close(targs[i].nmd);
                        targs[i].nmd = NULL;
                } else {
                        close(targs[i].fd);
                }
...

Do this:

        for (i = 0; i < g->nthreads; i++) {
                /*
                 * Join active threads, unregister interfaces and close
                 * file descriptors.
                 */
                if (targs[i].used)
                        pthread_join(targs[i].thread, NULL); /* blocking */
        }
        for (i = 0; i < g->nthreads; i++) {
                struct timespec t_tic, t_toc;
                if (g->dev_type == DEV_NETMAP) {
                        nm_close(targs[i].nmd);
                        targs[i].nmd = NULL;
                } else {
                        close(targs[i].fd);
                }
...

At least my modification to pkt-gen compiles, not sure how well it would work. In all applications I have created, I always first close every thread and then close every netmap descriptor. So instead of this:

close thread 0
close descriptor 0
close thread 1
close descriptor 1

I do this:

close thread 0
close thread 1
close descriptor 0
close descriptor 1

For my own applications and for the i40e driver, I can repeatedly open and close my packet processing application even on an interface that has continuous traffic. So I cannot reproduce the backtrace on my applications (and using i40e). The pkt-gen source code looks quite suspicious to me, perhaps the problem would be in pkt-gen instead of being in netmap kernel module? Although optimally netmap shouldn't crash the kernel when the userspace misbehaves.

markwebster commented 6 years ago

@jmtilli - thanks for the suggestion, but I still get the same kernel crash. The call stack is something like:

Call Trace: (RIP = __memcpy in IRQ)
ixgbe_check_tx_hang+0x5f/0xb0/0xc0 [ixgbe_netmap]
+ ixgbe_poll+0xf8/0x680 [ixgbe_netmap]
+ net_rx_action

In addition, I get a different crash if I start pkt-gen RX while simply pinging the interface at 50Hz, which is also a crash in memcpy in a slightly different place:

[ 2105.132468] general protection fault: 0000 [#1] SMP
[ 2105.134428] task: ffff945f22e8e400 task.stack: ffffb613431cc000
[ 2105.134561] RIP: 0010:[<ffffffff824854b2>]  [<ffffffff824854b2>] __memcpy+0x12/0x20
[ 2105.134808] RSP: 0018:ffff945f2fd03de0  EFLAGS: 00010246
[ 2105.134931] RAX: ffff945f1b3249c0 RBX: ffff945c22216e00 RCX: 000000000000000d
[ 2105.135057] RDX: 0000000000000000 RSI: 4924d05900000000 RDI: ffff945f1b3249c0
[ 2105.135182] RBP: ffff945f2fd03e48 R08: 0000000000000062 R09: 0000000000000000
[ 2105.135307] R10: 0000000000000043 R11: 0000000000000000 R12: ffffb6134835a000
[ 2105.135432] R13: ffff945c1e326000 R14: 0000000000000005 R15: ffff945f218cc100
[ 2105.135559] FS:  0000000000000000(0000) GS:ffff945f2fd00000(0000) knlGS:0000000000000000
[ 2105.135798] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2105.135922] CR2: 0000562ed567e208 CR3: 0000000203007000 CR4: 00000000000006e0
[ 2105.136048] Stack:
[ 2105.136166]  ffffffffc1011bfd ffff945f2fd18640 ffff945c22216c40 000001ea00000000
[ 2105.136412]  ffff945c22216c00 0000000000000000 0000004000000005 0000000000000100
[ 2105.136658]  0000000000000040 ffff945c22216e00 0000000000000000 ffff945c22216c00
[ 2105.136906] Call Trace:
[ 2105.137025]  <IRQ>
[ 2105.137042]  [<ffffffffc1011bfd>] ? ixgbe_clean_rx_irq+0x57d/0x930 [ixgbe_netmap]
[ 2105.137289]  [<ffffffffc1012d58>] ixgbe_poll+0xf8/0x680 [ixgbe_netmap]
[ 2105.137416]  [<ffffffff8261275e>] net_rx_action+0x1fe/0x340
[ 2105.137541]  [<ffffffff826aed6e>] __do_softirq+0xfe/0x280
[ 2105.137666]  [<ffffffff8225ae10>] irq_exit+0xb0/0xc0
[ 2105.137789]  [<ffffffff826aeac4>] do_IRQ+0x54/0xd0
[ 2105.137918]  [<ffffffff826ad0fc>] common_interrupt+0x7c/0x7c
[ 2105.138041]  <EOI>
[ 2105.138048]  [<ffffffff825e4fa6>] ? cpuidle_enter_state+0x126/0x2b0
[ 2105.138287]  [<ffffffff825e5167>] cpuidle_enter+0x17/0x20
[ 2105.138411]  [<ffffffff822933d7>] cpu_startup_entry+0x1a7/0x200
[ 2105.138537]  [<ffffffff8222fdac>] start_secondary+0x13c/0x170

giuseppelettieri commented 6 years ago

please try with the current master. It contains the fix from yesterday, plus the fix for another bug that may be cause of the other crashes.

giuseppelettieri commented 6 years ago

By the way, this only affects ixgbe. So the above message is only for @markwebster, I think.

markwebster commented 6 years ago

Thanks @giuseppelettieri that seems quite fixed now.

(@jmtilli sorry your thread got a bit hi-jacked)

giuseppelettieri commented 6 years ago

Good, thanks for testing! @jmtilli, can we close this?

jmtilli commented 6 years ago

@giuseppelettieri I pulled the newest master and tested it. I can repeatedly open and close my multi-queue application. I'll click "Close and comment". Thanks for the fix!

luigirizzo / netmap

Multithread netmap with i40e driver is unstable when application is uncontrollably closed #412