Closed larseggert closed 6 years ago
Is this still reproducible?
Don't think so. Will reopen if I come across this again.
This also happens under 4.14, but only with the kernel drivers (specifically, I can trigger this with igb.) With the external drivers, it doesn't seem to happen.
This has disappeared now.
I am now getting RCU stalls under 4.15 with 1G igb interfaces:
[ 7792.867002] Sending NMI from CPU 5 to CPUs 3:
[ 7792.868004] NMI backtrace for cpu 3
[ 7792.868005] CPU: 3 PID: 12621 Comm: warpping Tainted: G O 4.15.0.muclab+ #6
[ 7792.868006] Hardware name: FUJITSU PRIMERGY RX300 S8/D2939-B1, BIOS V4.6.5.4 R1.14.0 for D2939-B1x 10/13/2014
[ 7792.868006] RIP: 0010:io_serial_in+0xf/0x20
[ 7792.868006] RSP: 0018:ffff88103f183cd0 EFLAGS: 00000002
[ 7792.868007] RAX: 0005ea2728bbca00 RBX: ffffffff82353580 RCX: 0000000000000000
[ 7792.868007] RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff82353580
[ 7792.868008] RBP: 00000000000026ef R08: 0000000000000000 R09: 0000000000000004
[ 7792.868008] R10: 0000000000000000 R11: ffffffff8231a3cd R12: 0000000000000020
[ 7792.868008] R13: ffffffff8231a3f4 R14: 0000000000000034 R15: 0000000000000000
[ 7792.868009] FS: 00007f717c74f540(0000) GS:ffff88103f180000(0000) knlGS:0000000000000000
[ 7792.868009] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7792.868009] CR2: 00007f71018d0008 CR3: 0000001035b7e004 CR4: 00000000001606e0
[ 7792.868009] Call Trace:
[ 7792.868010] <IRQ>
[ 7792.868010] wait_for_xmitr+0x2b/0x80
[ 7792.868010] ? wait_for_xmitr+0x80/0x80
[ 7792.868010] serial8250_console_putchar+0x11/0x20
[ 7792.868011] uart_console_write+0x3c/0x50
[ 7792.868011] serial8250_console_write+0xcf/0x230
[ 7792.868011] ? msg_print_text+0x75/0xf0
[ 7792.868011] console_unlock+0x356/0x4c0
[ 7792.868012] vprintk_emit+0x292/0x2e0
[ 7792.868012] printk+0x3e/0x46
[ 7792.868012] ? sched_slice.isra.13+0x4c/0x90
[ 7792.868012] rcu_check_callbacks+0x6a0/0x900
[ 7792.868012] ? tick_init_highres+0x10/0x10
[ 7792.868013] update_process_times+0x23/0x50
[ 7792.868013] tick_sched_timer+0x3f/0x150
[ 7792.868013] __hrtimer_run_queues+0xc7/0x200
[ 7792.868013] hrtimer_interrupt+0xa1/0x1e0
[ 7792.868014] smp_apic_timer_interrupt+0x51/0x110
[ 7792.868014] apic_timer_interrupt+0x98/0xa0
[ 7792.868014] </IRQ>
[ 7792.868015] RIP: 0010:igb_clean_tx_ring+0xaf/0x180 [igb_netmap]
[ 7792.868015] RSP: 0018:ffffc90008a2fda0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[ 7792.868015] RAX: 00000000000000c7 RBX: ffffc90006dbf550 RCX: 0000000000000100
[ 7792.868016] RDX: 0000000000000000 RSI: ffffffff81eb3280 RDI: ffff8810375540a0
[ 7792.868016] RBP: ffff88100adf0c70 R08: ffff8810375540a0 R09: 0000000000000000
[ 7792.868016] R10: 0000000000003b0b R11: 0000000000000c00 R12: 00000000000000c7
[ 7792.868017] R13: ffff8810330fe940 R14: 00000000000000c8 R15: 0000000000000000
[ 7792.868017] igb_down+0x1b5/0x230 [igb_netmap]
[ 7792.868017] igb_netmap_reg+0x119/0x1a0 [igb_netmap]
[ 7792.868018] netmap_hw_reg+0x2a/0x70 [netmap]
[ 7792.868018] netmap_do_unregif+0x72/0x200 [netmap]
[ 7792.868018] netmap_priv_delete+0x2b/0x50 [netmap]
[ 7792.868018] netmap_dtor+0x18/0x30 [netmap]
[ 7792.868019] linux_netmap_release+0x11/0x20 [netmap]
[ 7792.868019] __fput+0x95/0x1d0
[ 7792.868019] task_work_run+0x7f/0xa0
[ 7792.868019] exit_to_usermode_loop+0x6e/0x70
[ 7792.868020] syscall_return_slowpath+0x9b/0xb0
[ 7792.868020] entry_SYSCALL_64_fastpath+0x85/0x87
[ 7792.868020] RIP: 0033:0x7f717c261844
[ 7792.868020] RSP: 002b:00007ffcf0ec7cd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 7792.868021] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 00007f717c261844
[ 7792.868021] RDX: 0000000000000000 RSI: 000000007a88c000 RDI: 0000000000000003
[ 7792.868022] RBP: 000055b5b01ef260 R08: 0000000000000000 R09: 00007f71018d1000
[ 7792.868022] R10: 000000000076c000 R11: 0000000000000246 R12: 000000000007a11f
[ 7792.868022] R13: 000055b5b01ef300 R14: 000055b5b01ef318 R15: 0000000000000000
[ 7792.868022] Code: 00 00 00 d3 e6 48 63 f6 48 03 77 10 8b 06 c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f b6 8f a1 00 00 00 8b 57 08 d3 e6 01 f2 ec <0f> b6 c0 c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f b6 8f a1
[ 7792.890990] NMI backtrace for cpu 3
[ 7792.890991] CPU: 3 PID: 12621 Comm: warpping Tainted: G O 4.15.0.muclab+ #6
[ 7792.890992] Hardware name: FUJITSU PRIMERGY RX300 S8/D2939-B1, BIOS V4.6.5.4 R1.14.0 for D2939-B1x 10/13/2014
[ 7792.890992] Call Trace:
[ 7792.890994] <IRQ>
[ 7792.890996] dump_stack+0x5c/0x7e
[ 7792.890999] nmi_cpu_backtrace+0xbf/0xd0
[ 7792.891002] ? lapic_can_unplug_cpu+0xa0/0xa0
[ 7792.891004] nmi_trigger_cpumask_backtrace+0x8a/0xc0
[ 7792.891006] rcu_dump_cpu_stacks+0x90/0xc6
[ 7792.891008] rcu_check_callbacks+0x6b0/0x900
[ 7792.891009] ? tick_init_highres+0x10/0x10
[ 7792.891011] update_process_times+0x23/0x50
[ 7792.891012] tick_sched_timer+0x3f/0x150
[ 7792.891013] __hrtimer_run_queues+0xc7/0x200
[ 7792.891014] hrtimer_interrupt+0xa1/0x1e0
[ 7792.891016] smp_apic_timer_interrupt+0x51/0x110
[ 7792.891017] apic_timer_interrupt+0x98/0xa0
[ 7792.891018] </IRQ>
[ 7792.891020] RIP: 0010:igb_clean_tx_ring+0xaf/0x180 [igb_netmap]
[ 7792.891021] RSP: 0018:ffffc90008a2fda0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[ 7792.891022] RAX: 00000000000000c7 RBX: ffffc90006dbf550 RCX: 0000000000000100
[ 7792.891022] RDX: 0000000000000000 RSI: ffffffff81eb3280 RDI: ffff8810375540a0
[ 7792.891023] RBP: ffff88100adf0c70 R08: ffff8810375540a0 R09: 0000000000000000
[ 7792.891023] R10: 0000000000003b0b R11: 0000000000000c00 R12: 00000000000000c7
[ 7792.891024] R13: ffff8810330fe940 R14: 00000000000000c8 R15: 0000000000000000
[ 7792.891027] igb_down+0x1b5/0x230 [igb_netmap]
[ 7792.891030] igb_netmap_reg+0x119/0x1a0 [igb_netmap]
[ 7792.891032] netmap_hw_reg+0x2a/0x70 [netmap]
[ 7792.891034] netmap_do_unregif+0x72/0x200 [netmap]
[ 7792.891037] netmap_priv_delete+0x2b/0x50 [netmap]
[ 7792.891039] netmap_dtor+0x18/0x30 [netmap]
[ 7792.891041] linux_netmap_release+0x11/0x20 [netmap]
[ 7792.891042] __fput+0x95/0x1d0
[ 7792.891043] task_work_run+0x7f/0xa0
[ 7792.891044] exit_to_usermode_loop+0x6e/0x70
[ 7792.891046] syscall_return_slowpath+0x9b/0xb0
[ 7792.891047] entry_SYSCALL_64_fastpath+0x85/0x87
[ 7792.891047] RIP: 0033:0x7f717c261844
[ 7792.891048] RSP: 002b:00007ffcf0ec7cd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 7792.891049] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 00007f717c261844
[ 7792.891049] RDX: 0000000000000000 RSI: 000000007a88c000 RDI: 0000000000000003
[ 7792.891050] RBP: 000055b5b01ef260 R08: 0000000000000000 R09: 00007f71018d1000
[ 7792.891050] R10: 000000000076c000 R11: 0000000000000246 R12: 000000000007a11f
[ 7792.891051] R13: 000055b5b01ef300 R14: 000055b5b01ef318 R15: 0000000000000000
[ 7855.906326] INFO: rcu_sched self-detected stall on CPU
[ 7855.912058] 3-....: (8402 ticks this GP) idle=7fe/140000000000001/0 softirq=151934/151934 fqs=3659
[ 7855.922247]
[ 7855.922248] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 7855.922250] (t=8407 jiffies g=5272 c=5271 q=925)
[ 7855.922251] NMI backtrace for cpu 3
[ 7855.922252] CPU: 3 PID: 12621 Comm: warpping Tainted: G O 4.15.0.muclab+ #6
[ 7855.922252] Hardware name: FUJITSU PRIMERGY RX300 S8/D2939-B1, BIOS V4.6.5.4 R1.14.0 for D2939-B1x 10/13/2014
[ 7855.922253] Call Trace:
[ 7855.922253] <IRQ>
[ 7855.922255] dump_stack+0x5c/0x7e
[ 7855.922256] nmi_cpu_backtrace+0xbf/0xd0
[ 7855.922258] ? lapic_can_unplug_cpu+0xa0/0xa0
[ 7855.922260] nmi_trigger_cpumask_backtrace+0x8a/0xc0
[ 7855.922261] rcu_dump_cpu_stacks+0x90/0xc6
[ 7855.922263] rcu_check_callbacks+0x6b0/0x900
[ 7855.922264] ? tick_init_highres+0x10/0x10
[ 7855.922265] update_process_times+0x23/0x50
[ 7855.922267] tick_sched_timer+0x3f/0x150
[ 7855.922268] __hrtimer_run_queues+0xc7/0x200
[ 7855.922269] hrtimer_interrupt+0xa1/0x1e0
[ 7855.922270] smp_apic_timer_interrupt+0x51/0x110
[ 7855.922271] apic_timer_interrupt+0x98/0xa0
[ 7855.922272] </IRQ>
[ 7855.922274] RIP: 0010:igb_clean_tx_ring+0xc1/0x180 [igb_netmap]
[ 7855.922275] RSP: 0018:ffffc90008a2fda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff11
[ 7855.922275] RAX: 0000000000000024 RBX: ffffc90006dbd6c0 RCX: 0000000000000100
[ 7855.922276] RDX: 0000000000000000 RSI: ffffffff81eb3280 RDI: ffff8810375540a0
[ 7855.922276] RBP: ffff88100adf0240 R08: ffff8810375540a0 R09: 0000000000000000
[ 7855.922277] R10: 0000000000003b0b R11: 0000000000000c00 R12: 0000000000000023
[ 7855.922277] R13: ffff8810330fe940 R14: 0000000000000024 R15: 0000000000000000
[ 7855.922280] igb_down+0x1b5/0x230 [igb_netmap]
[ 7855.922283] igb_netmap_reg+0x119/0x1a0 [igb_netmap]
[ 7855.922285] netmap_hw_reg+0x2a/0x70 [netmap]
[ 7855.922287] netmap_do_unregif+0x72/0x200 [netmap]
[ 7855.922289] netmap_priv_delete+0x2b/0x50 [netmap]
[ 7855.922291] netmap_dtor+0x18/0x30 [netmap]
[ 7855.922294] linux_netmap_release+0x11/0x20 [netmap]
[ 7855.922295] __fput+0x95/0x1d0
[ 7855.922296] task_work_run+0x7f/0xa0
[ 7855.922297] exit_to_usermode_loop+0x6e/0x70
[ 7855.922298] syscall_return_slowpath+0x9b/0xb0
[ 7855.922299] entry_SYSCALL_64_fastpath+0x85/0x87
may you please try the following patch?
diff --git a/LINUX/if_igb_netmap.h b/LINUX/if_igb_netmap.h
index 58926a5..b8cb562 100644
--- a/LINUX/if_igb_netmap.h
+++ b/LINUX/if_igb_netmap.h
@@ -178,7 +178,6 @@ igb_netmap_txsync(struct netmap_kring *kring, int flags)
wmb(); /* synchronize writes to the NIC ring */
/* (re)start the tx unit up to slot nic_i (excluded) */
- txr->next_to_use = nic_i;
writel(nic_i, txr->tail);
mmiowb(); // XXX why do we need this ?
}
@@ -255,6 +254,7 @@ igb_netmap_rxsync(struct netmap_kring *kring, int flags)
}
if (n) { /* update the state variables */
rxr->next_to_clean = nic_i;
+ rxr->next_to_alloc = nic_i;
kring->nr_hwtail = nm_i;
}
kring->nr_kflags &= ~NKR_PENDINTR;
@@ -291,7 +291,6 @@ igb_netmap_rxsync(struct netmap_kring *kring, int flags)
* so move nic_i back by one unit
*/
nic_i = nm_prev(nic_i, lim);
- rxr->next_to_use = nic_i;
writel(nic_i, rxr->tail);
}
@@ -375,7 +374,6 @@ igb_netmap_configure_rx_ring(struct igb_ring *rxr)
wmb(); /* Force memory writes to complete */
ND("%s rxr%d.tail %d", na->name, reg_idx, i);
- rxr->next_to_use = i;
writel(i, rxr->tail);
return 1; // success
}
The patch above seems to fix this issue
Merged, thanks.
Under Linux 4.11 with
--no-ext-drivers
(since they don't compile) I get RCU stalls in netmap mode.This wasn't the case with 4.8.