ssrg-vt / popcorn-kernel

Popcorn Linux kernel for distributed thread execution
Other
156 stars 22 forks source link

x86: NPB bt-induced deadlock on remote #79

Closed bxatnarf closed 4 years ago

bxatnarf commented 5 years ago

arch: amd64 branch: merge triggering executable: bt (note: file is compressed) triggering example's input: A file called inputbt.data in same directory from which you execute bt containing the following

10       number of time steps
0.01e0   dt for class A = 0.0008d0. class B = 0.0003d0  class C = 0.0001d0
2 2 2

kernel config:

CONFIG_ARCH_SUPPORTS_POPCORN=y
CONFIG_POPCORN=y
CONFIG_POPCORN_DEBUG=y
# CONFIG_POPCORN_DEBUG_PROCESS_SERVER is not set
# CONFIG_POPCORN_DEBUG_PAGE_SERVER is not set
# CONFIG_POPCORN_DEBUG_VMA_SERVER is not set
# CONFIG_POPCORN_DEBUG_VERBOSE is not set
# CONFIG_POPCORN_CHECK_SANITY is not set
# CONFIG_POPCORN_REMOTE_INFO is not set
# CONFIG_POPCORN_STAT is not set
CONFIG_POPCORN_KMSG=y
CONFIG_POPCORN_KMSG_SOCKET=m
# CONFIG_POPCORN_KMSG_TEST is not set
# CONFIG_POPCORN_DEBUG_MSG_LAYER is not set

After executing bt a couple of times, the remote node will eventually deadlock and execution on the origin will halt as it waits for the remote node to finish.

There are no interesting message in the origin's kernel log when this happens, however the remote node's console contains:

Message from syslogd@x86 at Feb 12 14:38:28 ...
 kernel:[  589.388764] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bt:668]

Message from syslogd@x86 at Feb 12 14:39:16 ...
 kernel:[  637.391605] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bt:668]

Message from syslogd@x86 at Feb 12 14:39:44 ...
 kernel:[  665.389329] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bt:668]

Message from syslogd@x86 at Feb 12 14:40:32 ...
 kernel:[  713.388493] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bt:668]

The remote node's kernel log contains:

[  150.782570] popcorn: Ready on TCP/IP
[  177.287193] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [bt:638]
[  177.287193] Modules linked in: msg_socket
[  177.287193] irq event stamp: 51784
[  177.287193] hardirqs last  enabled at (51783): [<ffffffff810019bd>] trace_hardirqs_on_thunk+0x1a/0x1c
[  177.287193] hardirqs last disabled at (51784): [<ffffffff810019d9>] trace_hardirqs_off_thunk+0x1a/0x1c
[  177.287193] softirqs last  enabled at (51782): [<ffffffff81800341>] __do_softirq+0x341/0x44c
[  177.287193] softirqs last disabled at (51775): [<ffffffff810651d6>] irq_exit+0xa6/0xe0
[  177.287193] CPU: 0 PID: 638 Comm: bt Not tainted 4.20.0-rc7-popcorn+ #123
[  177.287193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[  177.287193] RIP: 0010:lock_is_held_type+0x62/0x80
[  177.287193] Code: 25 80 4d 01 00 41 c7 84 24 a4 07 00 00 01 00 00 00 89 ee 48 89 df e8 0d c6 ff ff 41 c7 84 24 a4 07 00 00 00 00 00 00 41 55 9d <5b> 5d 41 5c 41 5d c3 5b b8 01 00e
[  177.287193] RSP: 0018:ffffc90000603c00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  177.287193] RAX: 0000000000000000 RBX: ffffffff81c4dea0 RCX: 0000000000000001
[  177.287193] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffff81c4dea0
[  177.287193] RBP: 00000000ffffffff R08: 000000000000001c R09: 0000000000000001
[  177.287193] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881379b6180
[  177.287193] R13: 0000000000000246 R14: ffffc90000603c90 R15: ffffea00043c5138
[  177.287193] FS:  0000000000000000(0000) GS:ffff88813fa00000(0000) knlGS:0000000000000000
[  177.287193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  177.287193] CR2: 0000000000862008 CR3: 0000000001c12000 CR4: 00000000000006f0
[  177.287193] Call Trace:
[  177.287193]  rcu_read_lock_sched_held+0x6c/0x80
[  177.287193]  free_pcp_prepare+0xff/0x130
[  177.287193]  free_unref_page_list+0x7a/0x240
[  177.287193]  release_pages+0x296/0x2e0
[  177.287193]  tlb_flush_mmu_free+0x31/0x50
[  177.287193]  unmap_page_range+0x62e/0x830
[  177.287193]  ? uprobe_clear_state+0x1f/0x80
[  177.287193]  unmap_vmas+0x3c/0x50
[  177.287193]  exit_mmap+0x91/0x160
[  177.287193]  mmput+0x41/0xf0
[  177.287193]  do_exit+0x38b/0xb50
[  177.287193]  ? remote_worker_main+0x3b8/0x590
[  177.287193]  ? handle_remote_futex_response+0x30/0x30
[  177.287193]  kthread+0xed/0x140
[  177.287193]  ? kthread_park+0x80/0x80
[  177.287193]  ret_from_fork+0x3a/0x50
[  185.415701] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bt:649]
[  185.415701] Modules linked in: msg_socket
[  185.415701] irq event stamp: 11046
[  185.415701] hardirqs last  enabled at (11045): [<ffffffff810019bd>] trace_hardirqs_on_thunk+0x1a/0x1c
[  185.415701] hardirqs last disabled at (11046): [<ffffffff810019d9>] trace_hardirqs_off_thunk+0x1a/0x1c
[  185.415701] softirqs last  enabled at (724): [<ffffffff81800341>] __do_softirq+0x341/0x44c
[  185.415701] softirqs last disabled at (697): [<ffffffff810651d6>] irq_exit+0xa6/0xe0
[  185.415701] CPU: 1 PID: 649 Comm: bt Tainted: G             L    4.20.0-rc7-popcorn+ #123
[  185.415701] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[  185.415701] RIP: 0010:lock_is_held_type+0x62/0x80
[  185.415701] Code: 25 80 4d 01 00 41 c7 84 24 a4 07 00 00 01 00 00 00 89 ee 48 89 df e8 0d c6 ff ff 41 c7 84 24 a4 07 00 00 00 00 00 00 41 55 9d <5b> 5d 41 5c 41 5d c3 5b b8 01 00e
[  185.415701] RSP: 0018:ffffc900005fbc00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  185.415701] RAX: 0000000000000000 RBX: ffffffff81c4dea0 RCX: 0000000000000001
[  185.415701] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffff81c4dea0
[  185.415701] RBP: 00000000ffffffff R08: 000000000000001c R09: 0000000000000001
[  185.415701] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888137a0c3c0
[  185.415701] R13: 0000000000000246 R14: ffffc900005fbc90 R15: ffffea00043c72e8
[  185.415701] FS:  0000000000000000(0000) GS:ffff88813fc00000(0000) knlGS:0000000000000000
[  185.415701] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  185.415701] CR2: 00007ffff7dfffe8 CR3: 0000000001c12000 CR4: 00000000000006e0
[  185.415701] Call Trace:
[  185.415701]  rcu_read_lock_sched_held+0x6c/0x80
[  185.415701]  free_pcp_prepare+0xff/0x130
[  185.415701]  free_unref_page_list+0x7a/0x240
[  185.415701]  release_pages+0x296/0x2e0
[  185.415701]  tlb_flush_mmu_free+0x31/0x50
[  185.415701]  unmap_page_range+0x62e/0x830
[  185.415701]  unmap_vmas+0x3c/0x50
[  185.415701]  exit_mmap+0x91/0x160
[  185.415701]  mmput+0x41/0xf0
[  185.415701]  do_exit+0x38b/0xb50
[  185.415701]  ? remote_worker_main+0x3b8/0x590
[  185.415701]  ? handle_remote_futex_response+0x30/0x30
[  185.415701]  kthread+0xed/0x140
[  185.415701]  ? kthread_park+0x80/0x80
[  185.415701]  ret_from_fork+0x3a/0x50
[  187.023802] rcu: INFO: rcu_sched self-detected stall on CPU
[  187.023802] rcu:     0-...!: (6499 ticks this GP) idle=136/1/0x4000000000000002 softirq=2985/2985 fqs=462 
[  187.023802] rcu:      (t=6500 jiffies g=5333 q=23)
[  187.023802] rcu: rcu_sched kthread starved for 5573 jiffies! g5333 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[  187.023802] rcu: RCU grace-period kthread stack dump:
[  187.023802] rcu_sched       R  running task        0    10      2 0x80000000
[  187.023802] Call Trace:
[  187.023802]  ? __schedule+0x28b/0xaf0
[  187.023802]  schedule+0x2f/0x90
[  187.023802]  schedule_timeout+0x1d4/0x500
[  187.023802]  ? collect_expired_timers+0xb0/0xb0
[  187.023802]  rcu_gp_kthread+0x468/0x8f0
[  187.023802]  ? cond_synchronize_rcu+0x20/0x20
[  187.023802]  kthread+0x120/0x140
[  187.023802]  ? kthread_park+0x80/0x80
[  187.023802]  ret_from_fork+0x3a/0x50
[  187.023802] NMI backtrace for cpu 0
[  187.023802] CPU: 0 PID: 638 Comm: bt Tainted: G             L    4.20.0-rc7-popcorn+ #123
[  187.023802] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[  187.023802] Call Trace:
[  187.023802]  <IRQ>
[  187.023802]  dump_stack+0x67/0x90
[  187.023802]  nmi_cpu_backtrace.cold.0+0x13/0x50
[  187.023802]  ? irq_force_complete_move.cold.4+0x1c/0x1c
[  187.023802]  nmi_trigger_cpumask_backtrace+0xa2/0xa4
[  187.023802]  rcu_dump_cpu_stacks+0x91/0xbd
[  187.023802]  rcu_check_callbacks.cold.49+0x177/0x339
[  187.023802]  ? rcu_read_lock_sched_held+0x6c/0x80
[  187.023802]  update_process_times+0x28/0x60
[  187.023802]  tick_periodic+0x27/0xb0
[  187.023802]  tick_handle_periodic+0x20/0x60
[  187.023802]  smp_apic_timer_interrupt+0x79/0x220
[  187.023802]  apic_timer_interrupt+0xf/0x20
[  187.023802]  </IRQ>
[  187.023802] RIP: 0010:lock_is_held_type+0x62/0x80
[  187.023802] Code: 25 80 4d 01 00 41 c7 84 24 a4 07 00 00 01 00 00 00 89 ee 48 89 df e8 0d c6 ff ff 41 c7 84 24 a4 07 00 00 00 00 00 00 41 55 9d <5b> 5d 41 5c 41 5d c3 5b b8 01 00e
[  187.023802] RSP: 0018:ffffc90000603c00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  187.023802] RAX: 0000000000000000 RBX: ffffffff81c4dea0 RCX: 0000000000000001
[  187.023802] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffff81c4dea0
[  187.023802] RBP: 00000000ffffffff R08: 000000000000001c R09: 0000000000000001
[  187.023802] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881379b6180
[  187.023802] R13: 0000000000000246 R14: ffffc90000603c90 R15: ffffea00043c5138
[  187.023802]  rcu_read_lock_sched_held+0x6c/0x80
[  187.023802]  free_pcp_prepare+0xff/0x130
[  187.023802]  free_unref_page_list+0x7a/0x240
[  187.023802]  release_pages+0x296/0x2e0
[  187.023802]  tlb_flush_mmu_free+0x31/0x50
[  187.023802]  unmap_page_range+0x62e/0x830
[  187.023802]  ? uprobe_clear_state+0x1f/0x80
[  187.023802]  unmap_vmas+0x3c/0x50
[  187.023802]  exit_mmap+0x91/0x160
[  187.023802]  mmput+0x41/0xf0
[  187.023802]  do_exit+0x38b/0xb50
[  187.023802]  ? remote_worker_main+0x3b8/0x590
[  187.023802]  ? handle_remote_futex_response+0x30/0x30
[  187.023802]  kthread+0xed/0x140
[  187.023802]  ? kthread_park+0x80/0x80
[  187.023802]  ret_from_fork+0x3a/0x50
[  213.285431] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [bt:638]
[  213.285443] Modules linked in: msg_socket
[  213.285443] irq event stamp: 69790
[  213.285443] hardirqs last  enabled at (69789): [<ffffffff810019bd>] trace_hardirqs_on_thunk+0x1a/0x1c
[  213.285443] hardirqs last disabled at (69790): [<ffffffff810019d9>] trace_hardirqs_off_thunk+0x1a/0x1c
[  213.285443] softirqs last  enabled at (51794): [<ffffffff81800341>] __do_softirq+0x341/0x44c
[  213.285443] softirqs last disabled at (51785): [<ffffffff810651d6>] irq_exit+0xa6/0xe0
[  213.285443] CPU: 0 PID: 638 Comm: bt Tainted: G             L    4.20.0-rc7-popcorn+ #123
[  213.285443] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[  213.285443] RIP: 0010:lock_is_held_type+0x62/0x80
[  213.285443] Code: 25 80 4d 01 00 41 c7 84 24 a4 07 00 00 01 00 00 00 89 ee 48 89 df e8 0d c6 ff ff 41 c7 84 24 a4 07 00 00 00 00 00 00 41 55 9d <5b> 5d 41 5c 41 5d c3 5b b8 01 00e
[  213.285443] RSP: 0018:ffffc90000603c00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  213.285443] RAX: 0000000000000000 RBX: ffffffff81c4dea0 RCX: 0000000000000001
[  213.285443] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffff81c4dea0
[  213.285443] RBP: 00000000ffffffff R08: 000000000000001c R09: 0000000000000001
[  213.285443] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881379b6180
[  213.285443] R13: 0000000000000246 R14: ffffc90000603c90 R15: ffffea00043c5e20
[  213.285443] FS:  0000000000000000(0000) GS:ffff88813fa00000(0000) knlGS:0000000000000000
[  213.285443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  213.285443] CR2: 0000000000862008 CR3: 0000000001c12000 CR4: 00000000000006f0
[  213.285443] Call Trace:
[  213.285443]  rcu_read_lock_sched_held+0x6c/0x80
[  213.285443]  free_pcp_prepare+0xff/0x130
[  213.285443]  free_unref_page_list+0x7a/0x240
[  213.285443]  release_pages+0x296/0x2e0
[  213.285443]  tlb_flush_mmu_free+0x31/0x50
[  213.285443]  unmap_page_range+0x62e/0x830
[  213.285443]  ? uprobe_clear_state+0x1f/0x80
[  213.285443]  unmap_vmas+0x3c/0x50
[  213.285443]  exit_mmap+0x91/0x160
[  213.285443]  mmput+0x41/0xf0
[  213.285443]  do_exit+0x38b/0xb50
[  213.285443]  ? remote_worker_main+0x3b8/0x590
[  213.285443]  ? handle_remote_futex_response+0x30/0x30
[  213.285443]  kthread+0xed/0x140
[  213.285443]  ? kthread_park+0x80/0x80
[  213.285443]  ret_from_fork+0x3a/0x50
[  213.413451] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [bt:649]
[  213.413451] Modules linked in: msg_socket
[  213.413451] irq event stamp: 25044
[  213.413451] hardirqs last  enabled at (25043): [<ffffffff810019bd>] trace_hardirqs_on_thunk+0x1a/0x1c
[  213.413451] hardirqs last disabled at (25044): [<ffffffff810019d9>] trace_hardirqs_off_thunk+0x1a/0x1c
[  213.413451] softirqs last  enabled at (724): [<ffffffff81800341>] __do_softirq+0x341/0x44c
[  213.413451] softirqs last disabled at (697): [<ffffffff810651d6>] irq_exit+0xa6/0xe0
[  213.413451] CPU: 1 PID: 649 Comm: bt Tainted: G             L    4.20.0-rc7-popcorn+ #123
[  213.413451] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[  213.413451] RIP: 0010:debug_check_no_locks_freed+0x7a/0x90
[  213.413451] Code: 8d 50 30 48 39 d7 73 2b 4c 39 e0 73 26 48 89 fd e8 4b fe 28 00 85 c0 74 0e 8b 05 51 c3 c3 00 85 c0 0f 84 c3 4d 00 00 41 56 9d <5b> 5d 41 5c 41 5d 41 5e c3 48 830
[  213.413451] RSP: 0018:ffffc900005fbc00 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  213.413451] RAX: 0000000000000000 RBX: 6db6db6db6db6db7 RCX: 0000000000000001
[  213.413451] RDX: 0000000000000001 RSI: 0000000000001000 RDI: ffff888135d76000
[  213.413451] RBP: 6db6db6db6db6db7 R08: 000000000000001c R09: 0000000000000001
[  213.413451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000160000000000
[  213.413451] R13: ffff888137a0c3c0 R14: 0000000000000202 R15: ffffea00043c71d0
[  213.413451] FS:  0000000000000000(0000) GS:ffff88813fc00000(0000) knlGS:0000000000000000
[  213.413451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  213.413451] CR2: 00007ffff7dfffe8 CR3: 0000000001c12000 CR4: 00000000000006e0
[  213.413451] Call Trace:
[  213.413451]  free_pcp_prepare+0x75/0x130
[  213.413451]  free_unref_page_list+0x7a/0x240
[  213.413451]  release_pages+0x296/0x2e0
[  213.413451]  tlb_flush_mmu_free+0x31/0x50
[  213.413451]  unmap_page_range+0x62e/0x830
[  213.413451]  unmap_vmas+0x3c/0x50
[  213.413451]  exit_mmap+0x91/0x160
[  213.413451]  mmput+0x41/0xf0
[  213.413451]  do_exit+0x38b/0xb50
[  213.413451]  ? remote_worker_main+0x3b8/0x590
[  213.413451]  ? handle_remote_futex_response+0x30/0x30
[  213.413451]  kthread+0xed/0x140
[  213.413451]  ? kthread_park+0x80/0x80
[  213.413451]  ret_from_fork+0x3a/0x50
AHatnarf commented 4 years ago

Going to close this after 1f12a34a25a122f6b0e512b0326b5b199daf215c. Thanks!