ssrg-vt / popcorn-kernel

Popcorn Linux kernel for distributed thread execution
Other
156 stars 23 forks source link

[merge branch] page fault issues during msg_layer rmmod #85

Closed bxatnarf closed 4 years ago

bxatnarf commented 5 years ago

The following error occurs when I rmmod the msg_layer even without performing process migration.

[ 1849.764725] BUG: unable to handle page fault for address: ffff8881398fb000
[ 1849.765595] #PF: supervisor write access in kernel mode
[ 1849.766159] #PF: error_code(0x000b) - reserved bit violation
[ 1849.766918] PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 1379c6063 PTE 800ffffec6704063
[ 1849.766918] Oops: 000b [#1] SMP NOPTI
[ 1849.766918] CPU: 1 PID: 644 Comm: bash Not tainted 5.2.0-rc4-popcorn+ #1
[ 1849.766918] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181124
[ 1849.766918] RIP: 0010:__tlb_remove_page_size+0x68/0x90
[ 1849.766918] Code: e0 5b 41 5c c3 83 7f 1c 13 74 33 31 f6 bf 00 28 00 00 e8 bb f7 00 00c
[ 1849.766918] RSP: 0018:ffffc9000064bb30 EFLAGS: 00000202
[ 1849.766918] RAX: ffff8881398fb000 RBX: ffffc9000064bc60 RCX: 000001fe00000000
[ 1849.766918] RDX: ffff888000000000 RSI: 00000000ffffffff RDI: 0000000000000246
[ 1849.766918] RBP: ffffea00044321f8 R08: 0000000000000000 R09: 0000000000000001
[ 1849.766918] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1849.766918] R13: 0000000000408000 R14: ffffc9000064bc60 R15: ffff888139a96000
[ 1849.766918] FS:  0000000000000000(0000) GS:ffff88813ba00000(0000) knlGS:000000000000000
[ 1849.766918] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1849.766918] CR2: ffff8881398fb000 CR3: 000000013999c000 CR4: 00000000000006e0
[ 1849.766918] Call Trace:
[ 1849.766918]  unmap_page_range+0x4c0/0x800
[ 1849.766918]  unmap_vmas+0x32/0x50
[ 1849.766918]  exit_mmap+0x8e/0x160
[ 1849.766918]  mmput+0x41/0xf0
[ 1849.766918]  flush_old_exec+0x31a/0x7e0
[ 1849.766918]  load_elf_binary+0x339/0x16b4
[ 1849.766918]  ? sched_clock_local+0x12/0x80
[ 1849.766918]  ? sched_clock_local+0x12/0x80
[ 1849.766918]  ? search_binary_handler.part.0+0x4b/0x220
[ 1849.766918]  search_binary_handler.part.0+0x5b/0x220
[ 1849.766918]  __do_execve_file+0x610/0xa70
[ 1849.766918]  __x64_sys_execve+0x27/0x30
[ 1849.766918]  do_syscall_64+0x69/0x440
[ 1849.766918]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1849.766918]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1849.766918] RIP: 0033:0x7ffff7498317
[ 1849.766918] Code: Bad RIP value.
[ 1849.766918] RSP: 002b:00007fffffffe858 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[ 1849.766918] RAX: ffffffffffffffda RBX: 0000000000731488 RCX: 00007ffff7498317
[ 1849.766918] RDX: 00000000007f6008 RSI: 0000000000731588 RDI: 0000000000731488
[ 1849.766918] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000003
[ 1849.766918] R10: 00007fffffffe620 R11: 0000000000000202 R12: 0000000000731488
[ 1849.766918] R13: 0000000000731588 R14: 00000000007f6008 R15: 00000000007313e8
[ 1849.766918] Modules linked in: msg_socket
[ 1849.766918] CR2: ffff8881398fb000
[ 1849.766918] ---[ end trace 41435ef31e4147f0 ]---
[ 1849.766918] RIP: 0010:__tlb_remove_page_size+0x68/0x90
[ 1849.766918] Code: e0 5b 41 5c c3 83 7f 1c 13 74 33 31 f6 bf 00 28 00 00 e8 bb f7 00 00c
[ 1849.766918] RSP: 0018:ffffc9000064bb30 EFLAGS: 00000202
[ 1849.766918] RAX: ffff8881398fb000 RBX: ffffc9000064bc60 RCX: 000001fe00000000
[ 1849.766918] RDX: ffff888000000000 RSI: 00000000ffffffff RDI: 0000000000000246
[ 1849.766918] RBP: ffffea00044321f8 R08: 0000000000000000 R09: 0000000000000001
[ 1849.766918] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1849.766918] R13: 0000000000408000 R14: ffffc9000064bc60 R15: ffff888139a96000
[ 1849.766918] FS:  0000000000000000(0000) GS:ffff88813ba00000(0000) knlGS:000000000000000
[ 1849.766918] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1849.766918] CR2: 00007ffff74982ed CR3: 000000013999c000 CR4: 00000000000006e0
[ 1849.766918] BUG: sleeping function called from invalid context at include/linux/percpu4
[ 1849.766918] in_atomic(): 1, irqs_disabled(): 1, pid: 644, name: bash
[ 1849.766918] INFO: lockdep is turned off.
[ 1849.766918] irq event stamp: 458
[ 1849.766918] hardirqs last  enabled at (457): [<ffffffff811d4dec>] get_page_from_freeli0
[ 1849.766918] hardirqs last disabled at (458): [<ffffffff81001a1c>] trace_hardirqs_off_tc
[ 1849.766918] softirqs last  enabled at (400): [<ffffffff8180032e>] __do_softirq+0x32e/09
[ 1849.766918] softirqs last disabled at (387): [<ffffffff81068377>] irq_exit+0x97/0xd0
[ 1849.766918] CPU: 1 PID: 644 Comm: bash Tainted: G      D           5.2.0-rc4-popcorn+ 1
[ 1849.766918] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181124
[ 1849.766918] Call Trace:
[ 1849.766918]  dump_stack+0x67/0x90
[ 1849.766918]  ___might_sleep.cold+0x9f/0xaf
[ 1849.766918]  exit_signals+0x1c/0x200
[ 1849.766918]  do_exit+0xb0/0xba0
[ 1849.766918]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1849.766918]  rewind_stack_do_exit+0x17/0x20
[ 1849.795136] note: bash[644] exited with preempt_count 1
AHatnarf commented 4 years ago

When removing the __exit annotation of exit_kmsg_sock() in socket.c I was able to reach the print statement of the successful removal of msg_sock, but was met with another oops shortly after (shown below).

This new oops is similar to one that I saw while the module was inserted, and the terminal was idle (i.e. no popcorn functionality used). I've added that trace below the first.

I'll play around a bit more and follow through with the other traces and see if they are related, or if allowing the exit function to be called from outside of the module init is the correct solution for this bug.

popcorn@x86:~$ sudo rmmod msg_socket
[   54.908815][  T614] BUG: unable to handle page fault for address: ffff888058b38000
[   54.909082][  T614] #PF: supervisor write access in kernel mode
[   54.909283][  T614] #PF: error_code(0x000b) - reserved bit violation
[   54.909460][  T614] PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 5a35d063 PTE 800fffffa74c7063
[   54.909696][  T614] Oops: 000b [#2] SMP NOPTI
[   54.909852][  T614] CPU: 1 PID: 614 Comm: bash Tainted: G      D W  O      5.2.0-rc4-popcorn+ #15
[   54.910087][  T614] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   54.910360][  T614] RIP: 0010:clear_page_orig+0x12/0x40
[   54.910513][  T614] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[   54.911014][  T614] RSP: 0018:ffffc90000687988 EFLAGS: 00000216
[   54.911199][  T614] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[   54.911401][  T614] RDX: ffff88805a1124c0 RSI: 0000000001367440 RDI: ffff888058b38000
[   54.911611][  T614] RBP: ffffc90000687b00 R08: 0000000000000000 R09: 0000000001367478
[   54.911990][  T614] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[   54.912254][  T614] R13: ffffffff81cd9f40 R14: ffffea0001367440 R15: ffffea0001367440
[   54.912254][  T614] FS:  00007ffff7fed700(0000) GS:ffff88805be00000(0000) knlGS:0000000000000000
[   54.912254][  T614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.912254][  T614] CR2: ffff888058b38000 CR3: 0000000058a26000 CR4: 00000000000006e0
[   54.912254][  T614] Call Trace:
[   54.912254][  T614]  get_page_from_freelist+0x7dc/0x13d0
[   54.912254][  T614]  ? lock_acquire+0xa6/0x1a0
[   54.912254][  T614]  ? fs_reclaim_acquire.part.26+0x5/0x30
[   54.912254][  T614]  __alloc_pages_nodemask+0x178/0xfa0
[   54.912254][  T614]  ? lock_acquire+0xa6/0x1a0
[   54.912254][  T614]  pte_alloc_one+0x17/0x70
[   54.912254][  T614]  __pte_alloc+0x16/0x110
[   54.912254][  T614]  copy_page_range+0x71c/0x850
[   54.912254][  T614]  ? lock_acquire+0xa6/0x1a0
[   54.912254][  T614]  dup_mm.isra.7+0x36c/0x4d0
[   54.912254][  T614]  copy_process.part.9+0x1bc0/0x1bf0
[   54.912254][  T614]  _do_fork+0xe4/0x6f0
[   54.912254][  T614]  ? __fd_install+0xc1/0x280
[   54.912254][  T614]  ? do_pipe2+0x7c/0xb0
[   54.912254][  T614]  do_syscall_64+0x69/0x440
[   54.912254][  T614]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   54.912254][  T614]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   54.912254][  T614] RIP: 0033:0x7ffff7498014
[   54.912254][  T614] Code: f7 d8 64 89 04 25 d4 02 00 00 64 4c 8b 0c 25 10 00 00 00 31 d2 4d 8d 91 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 3e 01 00 00 85 c0 41 89 c5 0f 85 45 01 00
[   54.912254][  T614] RSP: 002b:00007fffffffe7d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[   54.912254][  T614] RAX: ffffffffffffffda RBX: 0000000000000266 RCX: 00007ffff7498014
[   54.912254][  T614] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[   54.912254][  T614] RBP: 00007fffffffe7f0 R08: 0000000000000000 R09: 00007ffff7fed700
[   54.912254][  T614] R10: 00007ffff7fed9d0 R11: 0000000000000246 R12: 0000000000000000
[   54.912254][  T614] R13: 00000000007e9188 R14: 0000000000000000 R15: 00000000007e9228
[   54.912254][  T614] Modules linked in: msg_socket(O)
[   54.912254][  T614] CR2: ffff888058b38000
[   54.912254][  T614] ---[ end trace 7297bc0ea8aaa9fc ]---
[   54.912254][  T614] RIP: 0010:__tlb_remove_page_size+0x62/0x90
[   54.912254][  T614] Code: 5d c3 83 7f 1c 13 74 3a 31 f6 48 89 fb bf 00 28 00 00 e8 a1 f9 00 00 48 85 c0 74 26 48 b9 00 00 00 00 fe 01 00 00 83 43 1c 01 <48> c7 00 00 00 00 00 48 89 48 08 48 8b 53 20 48 89 02 48 89 43 20
[   54.912254][  T614] RSP: 0018:ffffc9000068fc90 EFLAGS: 00000202
[   54.912254][  T614] RAX: ffff888058b30000 RBX: ffffc9000068fda0 RCX: 000001fe00000000
[   54.912254][  T614] RDX: 0000000058b30000 RSI: 0000000000000000 RDI: 0000000000000246
[   54.912254][  T614] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[   54.912254][  T614] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000068fda0
[   54.912254][  T614] R13: 000055555555b000 R14: ffff888058aeead8 R15: ffff88805a081da8
[   54.912254][  T614] FS:  00007ffff7fed700(0000) GS:ffff88805be00000(0000) knlGS:0000000000000000
[   54.912254][  T614] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.912254][  T614] CR2: ffff888058b38000 CR3: 0000000058a26000 CR4: 00000000000006e0
[   54.912254][  T614] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[   54.912254][  T614] in_atomic(): 1, irqs_disabled(): 1, pid: 614, name: bash
[   54.912254][  T614] INFO: lockdep is turned off.
[   54.912254][  T614] irq event stamp: 49678
[   54.912254][  T614] hardirqs last  enabled at (49677): [<ffffffff815505f9>] _raw_spin_unlock_irq+0x29/0x40
[   54.912254][  T614] hardirqs last disabled at (49678): [<ffffffff81549be5>] __schedule+0xb5/0x810
[   54.912254][  T614] softirqs last  enabled at (47708): [<ffffffff818002ec>] __do_softirq+0x2ec/0x475
[   54.912254][  T614] softirqs last disabled at (47681): [<ffffffff8106856e>] irq_exit+0xbe/0xd0
[   54.912254][  T614] CPU: 1 PID: 614 Comm: bash Tainted: G      D W  O      5.2.0-rc4-popcorn+ #15
[   54.912254][  T614] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   54.912254][  T614] Call Trace:
[   54.912254][  T614]  dump_stack+0x67/0x9b
[   54.912254][  T614]  ___might_sleep+0x149/0x230
[   54.912254][  T614]  exit_signals+0x30/0x240
[   54.912254][  T614]  do_exit+0xb0/0xc30
[   54.912254][  T614]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   54.912254][  T614]  rewind_stack_do_exit+0x17/0x20
[   54.924324][  T614] note: bash[614] exited with preempt_count 1
[   57.658354][  T650] BUG: unable to handle page fault for address: ffff888058a67000
[   57.659526][  T650] #PF: supervisor write access in kernel mode
[   57.659526][  T650] #PF: error_code(0x000b) - reserved bit violation
[   57.659526][  T650] PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 58a30063 PTE 800fffffa7598063
[   57.659526][  T650] Oops: 000b [#3] SMP NOPTI
[   57.659526][  T650] CPU: 1 PID: 650 Comm: bash Tainted: G      D W  O      5.2.0-rc4-popcorn+ #15
[   57.659526][  T650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   57.659526][  T650] RIP: 0010:clear_page_orig+0x12/0x40
[   57.659526][  T650] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[   57.659526][  T650] RSP: 0000:ffffc900006dfbf8 EFLAGS: 00000216
[   57.659526][  T650] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[   57.659526][  T650] RDX: ffff888058838140 RSI: 0000000001364688 RDI: ffff888058a67000
[   57.659526][  T650] RBP: ffffc900006dfd70 R08: 0000000000000000 R09: 00000000013646c0
[   57.659526][  T650] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[   57.659526][  T650] R13: ffffffff81cd9f40 R14: ffffea0001364688 R15: ffffea0001364688
[   57.659526][  T650] FS:  00007ffff7fed700(0000) GS:ffff88805be00000(0000) knlGS:0000000000000000
[   57.659526][  T650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.659526][  T650] CR2: ffff888058a67000 CR3: 000000005a3b4000 CR4: 00000000000006e0
[   57.659526][  T650] Call Trace:
[   57.659526][  T650]  get_page_from_freelist+0x7dc/0x13d0
[   57.659526][  T650]  ? lock_acquire+0xa6/0x1a0
[   57.659526][  T650]  ? fs_reclaim_acquire.part.26+0x5/0x30
[   57.659526][  T650]  __alloc_pages_nodemask+0x178/0xfa0
[   57.659526][  T650]  ? __handle_mm_fault+0x25b/0xcc0
[   57.659526][  T650]  pte_alloc_one+0x17/0x70
[   57.659526][  T650]  do_fault+0x503/0x530
[   57.659526][  T650]  __handle_mm_fault+0x3b3/0xcc0
[   57.659526][  T650]  __do_page_fault+0x2c6/0x5b0
[   57.659526][  T650]  ? page_fault+0x8/0x30
[   57.659526][  T650]  page_fault+0x1e/0x30
[   57.659526][  T650] RIP: 0033:0x7ffff7498014
[   57.659526][  T650] Code: Bad RIP value.
[   57.659526][  T650] RSP: 002b:00007fffffffbc00 EFLAGS: 00000246
[   57.659526][  T650] RAX: 0000000000000000 RBX: 0000000000000265 RCX: 00007ffff7498014
[   57.659526][  T650] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[   57.659526][  T650] RBP: 00007fffffffbc20 R08: 0000000000000000 R09: 00007ffff7fed700
[   57.659526][  T650] R10: 00007ffff7fed9d0 R11: 0000000000000246 R12: 0000000000000000
[   57.659526][  T650] R13: 0000000000000000 R14: 000000000089d508 R15: 00007fffffffc02c
[   57.659526][  T650] Modules linked in: msg_socket(O)
[   57.659526][  T650] CR2: ffff888058a67000
[   57.659526][  T650] ---[ end trace 373066c0edf25053 ]---
[   57.659526][  T650] RIP: 0010:clear_page_orig+0x12/0x40
[   57.659526][  T650] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[   57.659526][  T650] RSP: 0000:ffffc900006a7bf8 EFLAGS: 00000216
[   57.659526][  T650] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[   57.659526][  T650] RDX: ffff8880589dc3c0 RSI: 00000000013646f8 RDI: ffff888058a69000
[   57.659526][  T650] RBP: ffffc900006a7d70 R08: 0000000000000000 R09: 0000000001364730
[   57.659526][  T650] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[   57.659526][  T650] R13: ffffffff81cd9f40 R14: ffffea00013646f8 R15: ffffea00013646f8
[   57.659526][  T650] FS:  00007ffff7fed700(0000) GS:ffff88805be00000(0000) knlGS:0000000000000000
[   57.659526][  T650] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.659526][  T650] CR2: 00007ffff7497fea CR3: 000000005a3b4000 CR4: 00000000000006e0
[   57.659526][  T650] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[   57.659526][  T650] in_atomic(): 1, irqs_disabled(): 1, pid: 650, name: bash
[   57.659526][  T650] INFO: lockdep is turned off.
[   57.659526][  T650] irq event stamp: 0
[   57.659526][  T650] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[   57.659526][  T650] hardirqs last disabled at (0): [<ffffffff8105f078>] copy_process.part.9+0x4d8/0x1bf0
[   57.659526][  T650] softirqs last  enabled at (0): [<ffffffff8105f078>] copy_process.part.9+0x4d8/0x1bf0
[   57.659526][  T650] softirqs last disabled at (0): [<0000000000000000>] 0x0
[   57.659526][  T650] CPU: 1 PID: 650 Comm: bash Tainted: G      D W  O      5.2.0-rc4-popcorn+ #15
[   57.659526][  T650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   57.659526][  T650] Call Trace:
[   57.659526][  T650]  dump_stack+0x67/0x9b
[   57.659526][  T650]  ___might_sleep+0x149/0x230
[   57.659526][  T650]  exit_signals+0x30/0x240
[   57.659526][  T650]  do_exit+0xb0/0xc30
[   57.659526][  T650]  rewind_stack_do_exit+0x17/0x20
[   57.679537][  T650] note: bash[650] exited with preempt_count 1
AHatnarf commented 4 years ago

After some additional testing, I think that this is a probable fix for this issue (the rmmod fault), I'll re-open if similar issues persist. Pretty sure the other oops I would see afterwards are unrelated to this specifically, rather printk and other functionality.

Will do a bit deeper investigation and see about this and random pagefaults while msg_socket is loaded.

AHatnarf commented 4 years ago

Closing for now in relation to the referenced message in #91.