google / kernel-sanitizers

Linux Kernel Sanitizers, fast bug-detectors for the Linux kernel
https://google.github.io/kernel-sanitizers/
437 stars 87 forks source link

Frequent lockups #185

Closed xairy closed 8 years ago

xairy commented 9 years ago
[  295.076187] INFO: rcu_sched detected stalls on CPUs/tasks:
[  295.076197]  (detected by 1, t=21003 jiffies, g=7740, c=7739, q=376)
[  295.076197] All QSes seen, last rcu_sched kthread activity 20540 (4294962375-4294941835), jiffies_till_next_fqs=3, root ->qsmask 0x0
[  295.076197] trinity-c14     R  running task    12888  2942   2606 0x0000000c
[  295.076197]  0000000000000000 0000000000003258 ffff8800b8768080 ffff88063fc83d08
[  295.076197]  ffffffff810c8f0a 00000000ffff9c8b 00000000ffffecc7 0000000000000000
[  295.076197]  ffffffff82440e80 ffff88063fc83dd8 ffffffff8110904d 0000000000000000
[  295.076197] Call Trace:
[  295.076197]  <IRQ>  [<ffffffff810c8f0a>] sched_show_task+0xfa/0x160
[  295.076197]  [<ffffffff8110904d>] rcu_check_callbacks+0xc8d/0xcb0
[  295.076197]  [<ffffffff81090508>] ? __raise_softirq_irqoff+0x28/0x90
[  295.076197]  [<ffffffff811105fe>] update_process_times+0x3e/0x70
[  295.076197]  [<ffffffff811298be>] tick_sched_handle.isra.15+0x7e/0xa0
[  295.076197]  [<ffffffff8112a4be>] tick_sched_timer+0x5e/0xb0
[  295.076197]  [<ffffffff81111bb7>] __hrtimer_run_queues+0x237/0x460
[  295.076197]  [<ffffffff8112a460>] ? tick_sched_do_timer+0x70/0x70
[  295.076197]  [<ffffffff811123a5>] ? hrtimer_interrupt+0xe5/0x290
[  295.076197]  [<ffffffff811123be>] hrtimer_interrupt+0xfe/0x290
[  295.076197]  [<ffffffff8105da87>] local_apic_timer_interrupt+0x77/0xc0
[  295.076197]  [<ffffffff8105de9e>] smp_trace_apic_timer_interrupt+0x8e/0x1b0
[  295.076197]  [<ffffffff81eb358b>] trace_apic_timer_interrupt+0x6b/0x70
[  295.076197]  <EOI>  [<ffffffff81251fb0>] ? kt_spin_unlock+0x20/0x30
[  295.076197]  [<ffffffff8124ec92>] ? ktsan_atomic32_load+0x52/0x140
[  295.076197]  [<ffffffff81131b1a>] ? csd_lock_wait+0x2a/0x40
[  295.076197]  [<ffffffff811b4ef0>] ? perf_event_comm_output+0x360/0x360
[  295.076197]  [<ffffffff81131b1a>] csd_lock_wait+0x2a/0x40
[  295.076197]  [<ffffffff81131db9>] smp_call_function_single+0xb9/0x190
[  295.076197]  [<ffffffff811b4ef0>] ? perf_event_comm_output+0x360/0x360
[  295.076197]  [<ffffffff811b3d9e>] perf_event_read+0x19e/0x1b0
[  295.076197]  [<ffffffff811b3f2b>] ? perf_event_read_value+0x17b/0x1c0
[  295.076197]  [<ffffffff811b3ec5>] perf_event_read_value+0x115/0x1c0
[  295.076197]  [<ffffffff811b4647>] perf_read+0x347/0x510
[  295.076197]  [<ffffffff81260319>] ? do_readv_writev+0x3f9/0x420
[  295.076197]  [<ffffffff8125eaef>] ? do_loop_readv_writev+0xbf/0x120
[  295.076197]  [<ffffffff8125eb15>] do_loop_readv_writev+0xe5/0x120
[  295.076197]  [<ffffffff811b4300>] ? perf_try_init_event+0x100/0x100
[  295.076197]  [<ffffffff811b4300>] ? perf_try_init_event+0x100/0x100
[  295.076197]  [<ffffffff81260319>] do_readv_writev+0x3f9/0x420
[  295.076197]  [<ffffffff811138f4>] ? do_setitimer+0x294/0x3b0
[  295.076197]  [<ffffffff8126036f>] ? vfs_readv+0x2f/0x80
[  295.076197]  [<ffffffff8124ac57>] ? kt_access+0x177/0x460
[  295.076197]  [<ffffffff812616bb>] ? SyS_readv+0x6b/0x140
[  295.076197]  [<ffffffff8126036f>] ? vfs_readv+0x2f/0x80
[  295.076197]  [<ffffffff8126039a>] vfs_readv+0x5a/0x80
[  295.076197]  [<ffffffff812616bb>] SyS_readv+0x6b/0x140
[  295.076197]  [<ffffffff81eb26ee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  295.076197] rcu_sched kthread starved for 20592 jiffies! g7740 c7739 f0x2
dvyukov commented 9 years ago

Do you see this warning:

285         /*
286          * Can deadlock when called with interrupts disabled.
287          * We allow cpu's that are not yet online though, as no one else can
288          * send smp call function interrupt to this cpu and as such deadlocks
289          * can't happen.
290          */
291         WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
292                      && !oops_in_progress);

from smp_call_function_single?

xairy commented 9 years ago

No.

dvyukov commented 9 years ago

Does it dump stacks on other CPUs? There must the second part of deadlock. If we see both stacks, we may figure out what happens.

xairy commented 9 years ago

No, I haven't seen stack from other CPUs. I'll post more reports.

xairy commented 9 years ago
INFO: rcu_sched detected stalls on CPUs/tasks:
    2: (1983 ticks this GP) idle=dcd/140000000000000/0 softirq=33018/33018 fqs=944
    (detected by 1, t=21013 jiffies, g=15551, c=15550, q=1584)
Task dump for CPU 2:
trinity-c8      R  running task    12776  3877   3838 0x0000000c
 ffffffff8272ae18 0000000000000003 ffffea0007c0f41c ffff88023c92fd90
 ffffffff811d3b51 ffff88023c92fd90 00000000ffffffff ffffffff81255e49
 ffff88023c9b6a70 ffffffff812505a6 0000000000000093 ffff88024d5fb000
Call Trace:
 [<     inlined    >] atomic_dec_and_test ./arch/x86/include/asm/atomic.h:151
 [<     inlined    >] put_page_testzero include/linux/mm.h:345
 [<ffffffff811d3b51>] ? __free_pages+0x31/0x90 mm/page_alloc.c:3288
 [<ffffffff81255e49>] ? kt_sync_ensure_created+0x29/0xb0 mm/ktsan/sync.c:13
 [<ffffffff812505a6>] ? kt_spin_unlock+0x16/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff8125620b>] ? kt_tab_access+0x11b/0x1b0 mm/ktsan/tab.c:129
 [<ffffffff8124f954>] ? kt_memblock_remove_sync+0x14/0x80 mm/ktsan/memblock.c:70
 [<ffffffff81255f4b>] ? kt_sync_drop_and_free+0x3b/0x60 mm/ktsan/sync.c:64
 [<ffffffff8124b4cf>] ? ktsan_sync_drop+0x7f/0xc0 mm/ktsan/ktsan.c:347 (discriminator 7)
 [<ffffffff811d3b5c>] ? __free_pages+0x3c/0x90 mm/page_alloc.c:3289
 [<ffffffff81222956>] ? __vunmap+0x96/0x160 mm/vmalloc.c:1485 (discriminator 2)
 [<ffffffff81222ac1>] ? vfree+0x41/0xb0 mm/vmalloc.c:1525
 [<     inlined    >] copy_module_from_user kernel/module.c:2675
 [<     inlined    >] SYSC_init_module kernel/module.c:3592
 [<ffffffff8113b229>] ? SyS_init_module+0x129/0x150 kernel/module.c:3579
 [<ffffffff81eb0d6e>] ? entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186
INFO: rcu_sched detected stalls on CPUs/tasks:
    3: (1674 ticks this GP) idle=fa7/140000000000000/0 softirq=33477/33477 fqs=5024
    (detected by 1, t=21005 jiffies, g=15552, c=15551, q=3880)
Task dump for CPU 3:
trinity-c2      R  running task    12536  3940   3838 0x0000000c
 ffffea0002971300 ffffffff811d3b5c 00000000000068da ffff8802149aa740
 ffff8802149aa768 ffff8802149aa760 ffff88020b87be78 ffffffff81222956
 0000000000000206 ffffc900242cc000 000000001ed50000 00007f901100d000
Call Trace:
 [<ffffffff811d3b5c>] ? __free_pages+0x3c/0x90 mm/page_alloc.c:3289
 [<ffffffff81222956>] ? __vunmap+0x96/0x160 mm/vmalloc.c:1485 (discriminator 2)
 [<ffffffff81222ac1>] ? vfree+0x41/0xb0 mm/vmalloc.c:1525
 [<     inlined    >] copy_module_from_user kernel/module.c:2675
 [<     inlined    >] SYSC_init_module kernel/module.c:3592
 [<ffffffff8113b229>] ? SyS_init_module+0x129/0x150 kernel/module.c:3579
 [<ffffffff81eb0d6e>] ? entry_SYSCALL_64_fastpath+0x12/0x71 arch/x86/entry/entry_64.S:186
xairy commented 9 years ago

The lockups were most probably caused by deadlock kt_sync_drop vs kt_memblock_free. This was fixed #65.

xairy commented 8 years ago

Seeing more lockups:

INFO: rcu_sched self-detected stall on CPU
    3: (20998 ticks this GP) idle=ec9/140000000000001/0 softirq=14965/14965 fqs=6967
INFO: rcu_sched detected stalls on CPUs/tasks:
    0: (1 GPs behind) idle=d3d/140000000000000/0 softirq=15777/15778 fqs=6967
    3: (20998 ticks this GP) idle=ec9/140000000000001/0 softirq=14965/14965 fqs=6967
    (detected by 1, t=21002 jiffies, g=7900, c=7899, q=271)
Task dump for CPU 0:
trinity-c4      R  running task        8  2861   2527 0x10000008
 0000070200000200 ffff880200800000 ffff88028c750f44 00000000de1be1c8
 0000000000000296 ffff8802de1be1c8 0000000000000286 ffff88028c750f40
 ffffffff82021910 ffff8802de1be1c8 ffff8800bb905b30 0000000000000296
Call Trace:
 [<ffffffff81384c1e>] ? __ext4_journal_stop+0x5e/0x110 fs/ext4/ext4_jbd2.c:88
 [<ffffffff81554baa>] ? iov_iter_advance+0x11a/0x310 lib/iov_iter.c:512
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff8125221b>] ? kt_tab_access+0x11b/0x1b0 mm/ktsan/tab.c:129
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff81251e90>] ? kt_sync_acquire+0x90/0x160 mm/ktsan/sync.c:93
 [<ffffffff8124cb7c>] ? ktsan_atomic32_compare_exchange+0xcc/0xf0 mm/ktsan/ktsan.c:709
 [<     inline     >] ? atomic_cmpxchg ./arch/x86/include/asm/atomic.h:235
 [<     inline     >] ? virt_queued_spin_lock ./arch/x86/include/asm/qspinlock.h:49
 [<ffffffff810dd899>] ? queued_spin_lock_slowpath+0x39/0x2c0 kernel/locking/qspinlock.c:292
 [<ffffffff8124cb7c>] ? ktsan_atomic32_compare_exchange+0xcc/0xf0 mm/ktsan/ktsan.c:709
 [<     inline     >] ? __raw_spin_lock include/asm-generic/qspinlock.h:83
 [<ffffffff81ea8286>] ? _raw_spin_lock+0x66/0x70 kernel/locking/spinlock.c:151
 [<     inline     >] ? spin_lock include/linux/spinlock.h:312
 [<     inline     >] ? lock_pte_protection mm/mprotect.c:49
 [<     inline     >] ? change_pte_range mm/mprotect.c:71
 [<     inline     >] ? change_pmd_range mm/mprotect.c:180
 [<     inline     >] ? change_pud_range mm/mprotect.c:206
 [<     inline     >] ? change_protection_range mm/mprotect.c:231
 [<ffffffff81218f29>] ? change_protection+0x619/0x760 mm/mprotect.c:252
 [<ffffffff812191ed>] ? mprotect_fixup+0x17d/0x2f0 mm/mprotect.c:324
 [<     inline     >] ? SYSC_mprotect mm/mprotect.c:425
 [<ffffffff8121953b>] ? SyS_mprotect+0x1db/0x310 mm/mprotect.c:346
 [<ffffffff8101d078>] ? syscall_trace_leave+0x88/0x160 arch/x86/kernel/ptrace.c:1634
 [<ffffffff81ea89fd>] ? tracesys_phase2+0x84/0x89 arch/x86/entry/entry_64.S:268
Task dump for CPU 3:
trinity-c14     R  running task       16  2983   2527 0x10000008
 ffff8800bae1e3b0 ffff880281d58840 ffffea000a03a600 ffff8800bae1e3b0
 ffff88027ea05680 0000000000000702 ffff880281ed7c68 ffffffff8121bc6c
 0000000000000292 0000000000000292 ffff8800bae1e408 ffff8800bae1e3b0
Call Trace:
 [<ffffffff8121bc6c>] ? try_to_unmap_one+0xbc/0x4f0 mm/rmap.c:1223
 [<ffffffff8121bbb0>] ? page_remove_rmap+0x40/0x40 mm/rmap.c:1183
 [<ffffffff8121bbb0>] ? page_remove_rmap+0x40/0x40 mm/rmap.c:1183
 [<     inline     >] ? rmap_walk_file mm/rmap.c:1540
 [<ffffffff8121d247>] ? rmap_walk+0x147/0x450 mm/rmap.c:1559
 [<ffffffff8121d90b>] ? try_to_unmap+0xbb/0x110 mm/rmap.c:1388
 [<ffffffff8121bbb0>] ? page_remove_rmap+0x40/0x40 mm/rmap.c:1183
 [<ffffffff8121adf0>] ? anon_vma_ctor+0x60/0x60 mm/rmap.c:423
 [<ffffffff8121c330>] ? page_get_anon_vma+0xd0/0xd0 ./arch/x86/include/asm/atomic.h:264
 [<     inline     >] ? __unmap_and_move mm/migrate.c:882
 [<     inline     >] ? unmap_and_move mm/migrate.c:941
 [<ffffffff81248226>] ? migrate_pages+0x326/0x9f0 mm/migrate.c:1130
 [<ffffffff8123ac80>] ? alloc_pages_vma+0x220/0x220 mm/mempolicy.c:1775
 [<ffffffff8123b615>] ? do_mbind+0x5b5/0x640 mm/mempolicy.c:1217
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff81237e77>] ? get_nodes+0x107/0x140 mm/mempolicy.c:1249
 [<     inline     >] ? SYSC_mbind mm/mempolicy.c:1319
 [<ffffffff8123b86a>] ? SyS_mbind+0xaa/0xb0 mm/mempolicy.c:1301
 [<ffffffff8101d078>] ? syscall_trace_leave+0x88/0x160 arch/x86/kernel/ptrace.c:1634
 [<ffffffff81ea89fd>] ? tracesys_phase2+0x84/0x89 arch/x86/entry/entry_64.S:268
     (t=21114 jiffies g=7900 c=7899 q=271)
Task dump for CPU 0:
trinity-c4      R  running task        8  2861   2527 0x10000008
 0000070200000200 ffff880200800000 ffff88028c750f44 00000000de1be1c8
 0000000000000296 ffff8802de1be1c8 0000000000000286 ffff88028c750f40
 ffffffff82021910 ffff8802de1be1c8 ffff8800bb905b30 0000000000000296
Call Trace:
 [<ffffffff81384c1e>] ? __ext4_journal_stop+0x5e/0x110 fs/ext4/ext4_jbd2.c:88
 [<ffffffff81554baa>] ? iov_iter_advance+0x11a/0x310 lib/iov_iter.c:512
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff8125221b>] ? kt_tab_access+0x11b/0x1b0 mm/ktsan/tab.c:129
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff81251e90>] ? kt_sync_acquire+0x90/0x160 mm/ktsan/sync.c:93
 [<ffffffff8124cadf>] ? ktsan_atomic32_compare_exchange+0x2f/0xf0 mm/ktsan/ktsan.c:704
 [<     inline     >] ? atomic_cmpxchg ./arch/x86/include/asm/atomic.h:235
 [<     inline     >] ? virt_queued_spin_lock ./arch/x86/include/asm/qspinlock.h:49
 [<ffffffff810dd899>] ? queued_spin_lock_slowpath+0x39/0x2c0 kernel/locking/qspinlock.c:292
 [<ffffffff8124cb7c>] ? ktsan_atomic32_compare_exchange+0xcc/0xf0 mm/ktsan/ktsan.c:709
 [<     inline     >] ? __raw_spin_lock include/asm-generic/qspinlock.h:83
 [<ffffffff81ea8286>] ? _raw_spin_lock+0x66/0x70 kernel/locking/spinlock.c:151
 [<     inline     >] ? spin_lock include/linux/spinlock.h:312
 [<     inline     >] ? lock_pte_protection mm/mprotect.c:49
 [<     inline     >] ? change_pte_range mm/mprotect.c:71
 [<     inline     >] ? change_pmd_range mm/mprotect.c:180
 [<     inline     >] ? change_pud_range mm/mprotect.c:206
 [<     inline     >] ? change_protection_range mm/mprotect.c:231
 [<ffffffff81218f29>] ? change_protection+0x619/0x760 mm/mprotect.c:252
 [<ffffffff812191ed>] ? mprotect_fixup+0x17d/0x2f0 mm/mprotect.c:324
 [<     inline     >] ? SYSC_mprotect mm/mprotect.c:425
 [<ffffffff8121953b>] ? SyS_mprotect+0x1db/0x310 mm/mprotect.c:346
 [<ffffffff8101d078>] ? syscall_trace_leave+0x88/0x160 arch/x86/kernel/ptrace.c:1634
 [<ffffffff81ea89fd>] ? tracesys_phase2+0x84/0x89 arch/x86/entry/entry_64.S:268
Task dump for CPU 3:
trinity-c14     R  running task       16  2983   2527 0x10000008
 0000000000000000 0000000000000010 ffff88028dd9c680 ffff88063fd83c98
 ffffffff810c8aa0 0000000640000000 0000000000000003 0000000000000003
 ffffffff82440e80 ffff88063fd83cb8 ffffffff810cb158 ffffffff82440e80
Call Trace:
 <IRQ>  [<ffffffff810c8aa0>] sched_show_task+0xf0/0x150 kernel/sched/core.c:4837
 [<ffffffff810cb158>] dump_cpu_task+0x38/0x40 kernel/sched/core.c:8484
 [<ffffffff81e95c8c>] rcu_dump_cpu_stacks+0xc3/0x113 kernel/rcu/tree.c:1201
 [<     inline     >] print_cpu_stall kernel/rcu/tree.c:1308
 [<     inline     >] check_cpu_stall kernel/rcu/tree.c:1372
 [<     inline     >] __rcu_pending kernel/rcu/tree.c:3474
 [<     inline     >] rcu_pending kernel/rcu/tree.c:3538
 [<ffffffff81107d0f>] rcu_check_callbacks+0x5ef/0xca0 kernel/rcu/tree.c:2760
 [<ffffffff81251e90>] ? kt_sync_acquire+0x90/0x160 mm/ktsan/sync.c:93
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff81249f8b>] ? kt_cache_alloc+0x2b/0x40 mm/ktsan/cache.c:53
 [<ffffffff8110f8de>] update_process_times+0x3e/0x70 kernel/time/timer.c:1397
 [<ffffffff81128b1e>] tick_sched_handle.isra.15+0x7e/0xa0 kernel/time/tick-sched.c:151
 [<ffffffff8112971e>] tick_sched_timer+0x5e/0xb0 kernel/time/tick-sched.c:1081
 [<     inline     >] __run_hrtimer kernel/time/hrtimer.c:1223
 [<ffffffff81110e87>] __hrtimer_run_queues+0x237/0x450 kernel/time/hrtimer.c:1287
 [<ffffffff811296c0>] ? tick_sched_do_timer+0x70/0x70 kernel/time/tick-sched.c:132
 [<ffffffff8111167e>] hrtimer_interrupt+0xfe/0x290 kernel/time/hrtimer.c:1321
 [<ffffffff8105d9b7>] local_apic_timer_interrupt+0x77/0xc0 arch/x86/kernel/apic/apic.c:890
 [<ffffffff8105dd06>] smp_apic_timer_interrupt+0x76/0xb0 arch/x86/kernel/apic/apic.c:914
 [<ffffffff81ea978a>] apic_timer_interrupt+0x8a/0xa0 arch/x86/entry/entry_64.S:790
 <EOI>  [<     inline     >] ? arch_local_irq_restore ./arch/x86/include/asm/paravirt.h:836
 <EOI>  [<ffffffff8124b95e>] ? ktsan_atomic32_load+0x9e/0xc0 mm/ktsan/ktsan.c:591
 [<     inline     >] ? rep_nop ./arch/x86/include/asm/processor.h:567
 [<     inline     >] ? cpu_relax ./arch/x86/include/asm/processor.h:572
 [<ffffffff81130d6d>] ? csd_lock_wait+0x1d/0x40 kernel/smp.c:111
 [<ffffffff81080a70>] ? do_flush_tlb_all+0x70/0x70 arch/x86/mm/tlb.c:248
 [<ffffffff81130d7a>] ? csd_lock_wait+0x2a/0x40 kernel/smp.c:110
 [<ffffffff81131019>] ? smp_call_function_single+0xb9/0x190 kernel/smp.c:303
 [<ffffffff81080a70>] ? do_flush_tlb_all+0x70/0x70 arch/x86/mm/tlb.c:248
 [<ffffffff81131748>] ? smp_call_function_many+0x368/0x3a0 kernel/smp.c:435
 [<ffffffff81080a70>] ? do_flush_tlb_all+0x70/0x70 arch/x86/mm/tlb.c:248
 [<ffffffff81080d4b>] ? native_flush_tlb_others+0x6b/0x80 arch/x86/mm/tlb.c:153
 [<     inline     >] ? flush_tlb_others ./arch/x86/include/asm/paravirt.h:362
 [<ffffffff81081151>] ? flush_tlb_page+0x91/0xf0 arch/x86/mm/tlb.c:237
 [<ffffffff81223aec>] ? ptep_clear_flush+0x8c/0xb0 mm/pgtable-generic.c:117
 [<ffffffff8121bc6c>] ? try_to_unmap_one+0xbc/0x4f0 mm/rmap.c:1223
 [<ffffffff8121bbb0>] ? page_remove_rmap+0x40/0x40 mm/rmap.c:1183
 [<ffffffff8121bbb0>] ? page_remove_rmap+0x40/0x40 mm/rmap.c:1183
 [<     inline     >] ? rmap_walk_file mm/rmap.c:1540
 [<ffffffff8121d247>] ? rmap_walk+0x147/0x450 mm/rmap.c:1559
 [<ffffffff8121d90b>] ? try_to_unmap+0xbb/0x110 mm/rmap.c:1388
 [<ffffffff8121bbb0>] ? page_remove_rmap+0x40/0x40 mm/rmap.c:1183
 [<ffffffff8121adf0>] ? anon_vma_ctor+0x60/0x60 mm/rmap.c:423
 [<ffffffff8121c330>] ? page_get_anon_vma+0xd0/0xd0 ./arch/x86/include/asm/atomic.h:264
 [<     inline     >] ? __unmap_and_move mm/migrate.c:882
 [<     inline     >] ? unmap_and_move mm/migrate.c:941
 [<ffffffff81248226>] ? migrate_pages+0x326/0x9f0 mm/migrate.c:1130
 [<ffffffff8123ac80>] ? alloc_pages_vma+0x220/0x220 mm/mempolicy.c:1775
 [<ffffffff8123b615>] ? do_mbind+0x5b5/0x640 mm/mempolicy.c:1217
 [<ffffffff8124e75e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<ffffffff81237e77>] ? get_nodes+0x107/0x140 mm/mempolicy.c:1249
 [<     inline     >] ? SYSC_mbind mm/mempolicy.c:1319
 [<ffffffff8123b86a>] ? SyS_mbind+0xaa/0xb0 mm/mempolicy.c:1301
 [<ffffffff8101d078>] ? syscall_trace_leave+0x88/0x160 arch/x86/kernel/ptrace.c:1634
 [<ffffffff81ea89fd>] ? tracesys_phase2+0x84/0x89 arch/x86/entry/entry_64.S:268
xairy commented 8 years ago
INFO: rcu_sched detected stalls on CPUs/tasks:
    (detected by 1, t=21003 jiffies, g=12860, c=12859, q=470)
All QSes seen, last rcu_sched kthread activity 21003 (4295121529-4295100526), jiffies_till_next_fqs=3, root ->qsmask 0x0
trinity-c5      S ffff88063fc95240 12960  3846   3564 0x00000000
 0000000000000000 00000000000032a0 ffff88028df547c0 ffff88063fc83d08
 ffffffff810c8aa0 000000010002086e 0000000100025a79 0000000000000000
 ffffffff82440e80 ffff88063fc83dd8 ffffffff8110839f 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff810c8aa0>] sched_show_task+0xf0/0x150 kernel/sched/core.c:4837
 [<     inline     >] print_other_cpu_stall kernel/rcu/tree.c:1272
 [<     inline     >] check_cpu_stall kernel/rcu/tree.c:1378
 [<     inline     >] __rcu_pending kernel/rcu/tree.c:3474
 [<     inline     >] rcu_pending kernel/rcu/tree.c:3538
 [<ffffffff8110839f>] rcu_check_callbacks+0xc7f/0xca0 kernel/rcu/tree.c:2760
 [<ffffffff81251e60>] ? kt_sync_acquire+0x90/0x160 mm/ktsan/sync.c:93
 [<ffffffff8110f8de>] update_process_times+0x3e/0x70 kernel/time/timer.c:1397
 [<ffffffff81128b1e>] tick_sched_handle.isra.15+0x7e/0xa0 kernel/time/tick-sched.c:151
 [<ffffffff8112971e>] tick_sched_timer+0x5e/0xb0 kernel/time/tick-sched.c:1081
 [<     inline     >] __run_hrtimer kernel/time/hrtimer.c:1223
 [<ffffffff81110e87>] __hrtimer_run_queues+0x237/0x450 kernel/time/hrtimer.c:1287
 [<ffffffff811296c0>] ? tick_sched_do_timer+0x70/0x70 kernel/time/tick-sched.c:132
 [<ffffffff8111167e>] hrtimer_interrupt+0xfe/0x290 kernel/time/hrtimer.c:1321
 [<ffffffff8105d9b7>] local_apic_timer_interrupt+0x77/0xc0 arch/x86/kernel/apic/apic.c:890
 [<ffffffff8105ddce>] smp_trace_apic_timer_interrupt+0x8e/0x1a0 arch/x86/kernel/apic/apic.c:934
 [<     inline     >] ? __raw_read_unlock include/linux/rwlock_api_smp.h:254
 [<ffffffff8108ed24>] ? do_wait+0x324/0x3f0 kernel/exit.c:1499
 [<ffffffff81ea992a>] trace_apic_timer_interrupt+0x8a/0xa0 arch/x86/entry/entry_64.S:790
 <EOI>  [<     inline     >] ? __raw_read_unlock include/linux/rwlock_api_smp.h:254
 <EOI>  [<ffffffff8108ed24>] ? do_wait+0x324/0x3f0 kernel/exit.c:1499
 [<     inline     >] ? arch_local_irq_restore ./arch/x86/include/asm/paravirt.h:836
 [<ffffffff8124bb97>] ? ktsan_mtx_post_unlock+0x97/0xc0 mm/ktsan/ktsan.c:468
 [<     inline     >] ? __raw_read_unlock include/linux/rwlock_api_smp.h:254
 [<ffffffff8108ed24>] ? do_wait+0x324/0x3f0 kernel/exit.c:1499
 [<     inline     >] ? SYSC_wait4 kernel/exit.c:1615
 [<ffffffff8108f335>] ? SyS_wait4+0xe5/0x160 kernel/exit.c:1584
 [<ffffffff8108b070>] ? task_stopped_code+0x90/0x90 kernel/exit.c:1127
 [<ffffffff81ea89d1>] ? entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188
rcu_sched kthread starved for 21024 jiffies! g12860 c12859 f0x2
INFO: rcu_sched detected stalls on CPUs/tasks:
    (detected by 1, t=84008 jiffies, g=12860, c=12859, q=478)
All QSes seen, last rcu_sched kthread activity 84008 (4295184534-4295100526), jiffies_till_next_fqs=3, root ->qsmask 0x0
trinity-c5      R  running task    12960  3846   3564 0x00000000
 0000000000000000 00000000000032a0 ffff88028df547c0 ffff88063fc83d08
 ffffffff810c8aa0 000000010002086e 0000000100035096 0000000000000000
 ffffffff82440e80 ffff88063fc83dd8 ffffffff8110839f 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff810c8aa0>] sched_show_task+0xf0/0x150 kernel/sched/core.c:4837
 [<     inline     >] print_other_cpu_stall kernel/rcu/tree.c:1272
 [<     inline     >] check_cpu_stall kernel/rcu/tree.c:1378
 [<     inline     >] __rcu_pending kernel/rcu/tree.c:3474
 [<     inline     >] rcu_pending kernel/rcu/tree.c:3538
 [<ffffffff8110839f>] rcu_check_callbacks+0xc7f/0xca0 kernel/rcu/tree.c:2760
 [<ffffffff81251e60>] ? kt_sync_acquire+0x90/0x160 mm/ktsan/sync.c:93
 [<ffffffff8110f8de>] update_process_times+0x3e/0x70 kernel/time/timer.c:1397
 [<ffffffff81128b1e>] tick_sched_handle.isra.15+0x7e/0xa0 kernel/time/tick-sched.c:151
 [<ffffffff8112971e>] tick_sched_timer+0x5e/0xb0 kernel/time/tick-sched.c:1081
 [<     inline     >] __run_hrtimer kernel/time/hrtimer.c:1223
 [<ffffffff81110e87>] __hrtimer_run_queues+0x237/0x450 kernel/time/hrtimer.c:1287
 [<ffffffff811296c0>] ? tick_sched_do_timer+0x70/0x70 kernel/time/tick-sched.c:132
 [<ffffffff8111167e>] hrtimer_interrupt+0xfe/0x290 kernel/time/hrtimer.c:1321
 [<ffffffff8105d9b7>] local_apic_timer_interrupt+0x77/0xc0 arch/x86/kernel/apic/apic.c:890
 [<ffffffff8105ddce>] smp_trace_apic_timer_interrupt+0x8e/0x1a0 arch/x86/kernel/apic/apic.c:934
 [<ffffffff81ea992a>] trace_apic_timer_interrupt+0x8a/0xa0 arch/x86/entry/entry_64.S:790
 <EOI>  [<ffffffff8124e72e>] ? kt_spin_unlock+0xe/0x20 mm/ktsan/spinlock.c:20
 [<     inline     >] ? arch_local_irq_restore ./arch/x86/include/asm/paravirt.h:836
 [<     inline     >] ? __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:179
 [<ffffffff81ea7fac>] ? _raw_spin_unlock_irqrestore+0x4c/0x60 kernel/locking/spinlock.c:191
 [<     inline     >] spin_unlock_irqrestore include/linux/spinlock.h:372
 [<ffffffff810da76f>] add_wait_queue+0x3f/0x50 kernel/sched/wait.c:30
 [<ffffffff8108eabc>] do_wait+0xbc/0x3f0 kernel/exit.c:1471
 [<     inline     >] SYSC_wait4 kernel/exit.c:1615
 [<ffffffff8108f335>] SyS_wait4+0xe5/0x160 kernel/exit.c:1584
 [<ffffffff8108b070>] ? task_stopped_code+0x90/0x90 kernel/exit.c:1127
 [<ffffffff81ea89d1>] entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188
rcu_sched kthread starved for 84029 jiffies! g12860 c12859 f0x2
xairy commented 8 years ago
INFO: rcu_sched detected stalls on CPUs/tasks:
        (detected by 3, t=21002 jiffies, g=8535, c=8534, q=749)
All QSes seen, last rcu_sched kthread activity 21000 (4295905729-4295884729), jiffies_till_next_fqs=3, r
oot ->qsmask 0x0
swapper/3       R  running task    14800     0      1 0x00000000
 0000000000000000 00000000000039d0 ffff88028e734180 ffff88063fd83d18
 ffffffff810c8aa0 00000001000dffb9 00000001000e51c1 0000000000000000
 ffffffff82440e80 ffff88063fd83de8 ffffffff8110839f 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff810c8aa0>] sched_show_task+0xf0/0x150 kernel/sched/core.c:4837
 [<     inline     >] print_other_cpu_stall kernel/rcu/tree.c:1272
 [<     inline     >] check_cpu_stall kernel/rcu/tree.c:1378
 [<     inline     >] __rcu_pending kernel/rcu/tree.c:3474
 [<     inline     >] rcu_pending kernel/rcu/tree.c:3538
 [<ffffffff8110839f>] rcu_check_callbacks+0xc7f/0xca0 kernel/rcu/tree.c:2760
 [<ffffffff8110f8de>] update_process_times+0x3e/0x70 kernel/time/timer.c:1397
 [<ffffffff81128b1e>] tick_sched_handle.isra.15+0x7e/0xa0 kernel/time/tick-sched.c:151
 [<ffffffff8112971e>] tick_sched_timer+0x5e/0xb0 kernel/time/tick-sched.c:1081
 [<     inline     >] __run_hrtimer kernel/time/hrtimer.c:1223
 [<ffffffff81110e87>] __hrtimer_run_queues+0x237/0x450 kernel/time/hrtimer.c:1287
 [<ffffffff811296c0>] ? tick_sched_do_timer+0x70/0x70 kernel/time/tick-sched.c:132
 [<ffffffff8111167e>] hrtimer_interrupt+0xfe/0x290 kernel/time/hrtimer.c:1321
 [<ffffffff8105d9b7>] local_apic_timer_interrupt+0x77/0xc0 arch/x86/kernel/apic/apic.c:890
 [<ffffffff8105dd06>] smp_apic_timer_interrupt+0x76/0xb0 arch/x86/kernel/apic/apic.c:914
 [<ffffffff81eaa30a>] apic_timer_interrupt+0x8a/0xa0 arch/x86/entry/entry_64.S:790
 <EOI>  [<ffffffff8106fd36>] ? native_safe_halt+0x6/0x10 ./arch/x86/include/asm/irqflags.h:49
 [<     inline     >] arch_safe_halt ./arch/x86/include/asm/paravirt.h:111
 [<ffffffff81015c51>] default_idle+0x41/0x160 arch/x86/kernel/process.c:301
 [<ffffffff8101679f>] arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:292
 [<ffffffff810db233>] default_idle_call+0x43/0x60 kernel/sched/idle.c:89
 [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:157
 [<     inline     >] cpu_idle_loop kernel/sched/idle.c:253
 [<ffffffff810db665>] cpu_startup_entry+0x375/0x3e0 kernel/sched/idle.c:301
 [<ffffffff8105a8ff>] start_secondary+0x22f/0x240 arch/x86/kernel/smpboot.c:261
dvyukov commented 8 years ago

I wonder why it hangs in safe_halt... did we forgot to enable interrupts somewhere in runtime?

dvyukov commented 8 years ago

Been experiencing this since kernels 4.* rolling back to 3.19 and it's gone.

@cirrusUK can you please elaborate? What's been rolling back? And what's done?

dvyukov commented 8 years ago

@cirrusUK Are you using CONFIG_KTSAN? Did you backport it to 3.19?

cirrusUK commented 8 years ago

dvyuko Humble apologies, it would appear i posted in wrong thread, i will close/delete the issue. I'm sorry for the noise.