kohler / click

The Click modular router: fast modular packet processing and analysis
Other
740 stars 321 forks source link

Current router thread refuse to die #362

Open tbarbette opened 7 years ago

tbarbette commented 7 years ago

Hi all, I tried multiple kernel versions (3.2, 3.16 and 4.4) , multiple configuration options, but I always get this error when unloading the Click module. The whole router is unstable and ends up crashing when loaded with enough packets. capture d ecran de 2017-07-26 23-21-34

Any idea?

tbarbette commented 7 years ago

I think this is the backtrace of the part refusing to die :

[ 695.433491] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 12, t=99783 jiffies, g=3456, c=3455, q=210526) [ 695.444916] sending NMI to all CPUs: [ 695.448537] NMI backtrace for cpu 0 [ 695.452081] CPU: 0 PID: 2370 Comm: kclick Tainted: G W O 3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u2 [ 695.461886] Hardware name: PRIMINFO X99-A/X99-A, BIOS 2101 11/26/2015 [ 695.468381] task: ffff88081dfd0b60 ti: ffff88081d860000 task.ti: ffff88081d860000 [ 695.475927] RIP: 0010:[] [] _raw_spin_lock_irqsave+0x1f/0x50 [ 695.484928] RSP: 0018:ffff88083fc03d28 EFLAGS: 00000006 [ 695.490296] RAX: 0000000000000046 RBX: ffff88081c75af00 RCX: 0000000000002823 [ 695.497490] RDX: 0000000028242824 RSI: 0000000000000046 RDI: ffff88081c75afa8 [ 695.504681] RBP: ffff88081c75afa8 R08: 00000000000001ff R09: 0000000000000000 [ 695.511869] R10: 0000000000000000 R11: 0000000000000006 R12: ffff88081ba860b0 [ 695.519060] R13: 0000000000000046 R14: 0000000000000000 R15: 0000000000000001 [ 695.526251] FS: 0000000000000000(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000 [ 695.534405] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 695.540204] CR2: 0000555555834148 CR3: 0000000001813000 CR4: 00000000001407f0 [ 695.547396] Stack: [ 695.549462] ffffffffa099c5bf ffff88081ba86000 0000000000000202 ffff88081a4fc200 [ 695.557154] ffff88081a4fc200 0000000000000000 0000000000000001 ffffffffa09f278c [ 695.564846] ffff88081ba86000 0000000000000246 ffff88081a4fc200 ffff88081afbb000 [ 695.572539] Call Trace: [ 695.575042] [ 695.577012] [] ? _ZN4Task11add_pendingEb+0x3f/0x160 [click] [ 695.584716] [] ? _ZN10FromDevice7got_skbEP7sk_buff.part.20+0x21c/0x240 [click] [ 695.593713] [] ? click_fromdevice_rx_handler+0x91/0x120 [click] [ 695.601349] [] ? __netif_receive_skb_core+0x1c4/0x770 [ 695.608106] [] ? read_tsc+0x5/0x20 [ 695.613210] [] ? netif_receive_skb_internal+0x1f/0x80 [ 695.619969] [] ? napi_gro_receive+0xd0/0x100 [ 695.625962] [] ? i40e_napi_poll+0x4ed/0x1030 [i40e] [ 695.632546] [] ? net_rx_action+0x129/0x250 [ 695.638348] [] ? __do_softirq+0xf1/0x2d0 [ 695.643975] [] ? irq_exit+0x95/0xa0 [ 695.649172] [] ? do_IRQ+0x52/0xe0 [ 695.654194] [] ? common_interrupt+0x6d/0x6d [ 695.660077] [ 695.662045] [] ? _ZN4Task17complete_scheduleEP12RouterThread+0x68/0xe0 [click] [ 695.671363] [] ? _ZN4Task17complete_scheduleEP12RouterThread+0x5/0xe0 [click] [ 695.680264] [] ? _ZN12RouterThread6driverEv+0x347/0x480 [click] [ 695.687967] [] ? _ZL11click_schedPv+0x100/0x2d0 [click] [ 695.694965] [] ? _Z19click_cleanup_schedv+0x160/0x160 [click] [ 695.702431] [] ? kthread+0xbd/0xe0 [ 695.707542] [] ? kthread_create_on_node+0x180/0x180 [ 695.714126] [] ? ret_from_fork+0x58/0x90 [ 695.719756] [] ? kthread_create_on_node+0x180/0x180

Any idea ? Maybe @peterhurley could give me a direction?

This is on 3.16 but I had similar issue on 3.2 and 4.4. I recompiled in single-thread mode, without any other compile options.

pallas commented 7 years ago

My guess would be that either there is a broken data structure --- maybe due to releasing the lock too early in Task::complete_schedule --- and you're getting into an infinite loop in Task::add_pending or that there was a schedule-while-atomic in some other piece of code that might hold the lock in Task::add_pending.

pallas commented 7 years ago

Have you tried turning on lock checking in the kernel?

tbarbette commented 7 years ago

Thanks for the idea, I got just a little more info :

In the same stack I've got this line on top of it, seems like a bad RCU [ 1524.679257] [] _raw_spin_unlock_irqrestore+0x2c/0x50 [ 1524.686283] [] _ZN4Task11add_pendingEb+0x29/0x16c [click] [ 1524.693460] [] _ZN4Task17complete_scheduleEP12RouterThread+0x7c/0xd2 [click]

And a newcommer : [ 1507.725237] INFO: rcu_sched detected stalls on CPUs/tasks: [ 1507.730771] 0-...: (3 GPs behind) idle=73b/140000000000001/0 softirq=44151/44151 fqs=5248 [ 1507.739187] (detected by 12, t=5255 jiffies, g=41736, c=41735, q=11605) [ 1507.745951] Task dump for CPU 0: [ 1507.749190] kclick R running task 0 24596 2 0x00000008 [ 1507.756311] 0000000000000001 ffff88081f20df40 ffff8805f1d37ca0 ffffffff810ff4de [ 1507.763869] 0000000000000046 03e5a5b51ce63457 ffff8805f1d37cf0 0000000000000282 [ 1507.771402] ffff8805f1d37cd0 ffffffff810ff5b2 ffffffff810ff515 ffff8805f00bcc00 [ 1507.778916] Call Trace: [ 1507.781392] [] ? try_to_del_timer_sync+0x5e/0x90 [ 1507.787683] [] ? del_timer_sync+0xa2/0xc0 [ 1507.793371] [] ? del_timer_sync+0x5/0xc0 [ 1507.799227] [] ? _ZN12RouterThread6driverEv+0x10b/0x50a [click] [ 1507.806997] [] ? _ZN4Task15process_pendingEP12RouterThread+0x27/0xec [click] [ 1507.815935] [] ? _ZN12RouterThread6driverEv+0x124/0x50a [click] [ 1507.823771] [] ? _Z19click_cleanup_schedv+0x182/0x182 [click] [ 1507.831441] [] ? _ZL11click_schedPv+0x123/0x21a [click] [ 1507.838360] [] ? kthread+0x10c/0x130 [ 1507.843619] [] ? kthread_create_on_node+0x230/0x230 [ 1507.850176] [] ? ret_from_fork+0x3f/0x70 [ 1507.855802] [] ? kthread_create_on_node+0x230/0x230