Closed AHatnarf closed 4 years ago
Same bug appeared when SSHing into a popcorn node, connection would be reset a few times while sshd was killed. This was not during shutdown, just while the system was idle. After three tries was able to log in successfully and run mt.
[ T654] #PF: supervisor write access in kernel mode
[ T654] #PF: error_code(0x000b) - reserved bit violation
[ T654] PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 59420063 PTE 800fffffa6b12063
[ T654] Oops: 000b [#1] SMP NOPTI
[ T654] CPU: 0 PID: 654 Comm: sshd Tainted: G O 5.2.0-rc4-popcorn+ #32
[ T654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ T654] RIP: 0010:clear_page_orig+0x12/0x40
[ T654] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[ T654] RSP: 0018:ffffc900006cb988 EFLAGS: 00000216
[ T654] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[ T654] RDX: ffff8880596fc2c0 RSI: 00000000013893d8 RDI: ffff8880594ed000
[ T654] RBP: ffffc900006cbb00 R08: 0000000000000000 R09: 0000000001389410
[ T654] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[ T654] R13: ffffffff81cd9f40 R14: ffffea00013893d8 R15: ffffea00013893d8
[ T654] FS: 00007ffff7fe5800(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000
[ T654] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T654] CR2: ffff8880594ed000 CR3: 00000000596b6000 CR4: 00000000000006f0
[ T654] Call Trace:
[ T654] get_page_from_freelist+0x7dc/0x13d0
[ T654] ? sched_clock_local+0x12/0x80
[ T654] ? sched_clock_local+0x12/0x80
[ T654] ? sched_clock_local+0x12/0x80
[ T654] __alloc_pages_nodemask+0x178/0xfa0
[ T654] ? sched_clock_local+0x12/0x80
[ T654] ? sched_clock_local+0x12/0x80
[ T654] ? sched_clock_local+0x12/0x80
[ T654] ? sched_clock_local+0x12/0x80
[ T654] ? sched_clock_local+0x12/0x80
[ T654] pte_alloc_one+0x17/0x70
[ T654] __pte_alloc+0x16/0x110
[ T654] copy_page_range+0x71c/0x850
[ T654] ? sched_clock_local+0x12/0x80
[ T654] dup_mm.isra.7+0x36c/0x4d0
[ T654] copy_process.part.9+0x1bc0/0x1bf0
[ T654] _do_fork+0xe4/0x6f0
[ T654] ? ksys_mmap_pgoff+0xaf/0x130
[ T654] do_syscall_64+0x69/0x440
[ T654] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ T654] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ T654] RIP: 0033:0x7ffff6336014
[ T654] Code: f7 d8 64 89 04 25 d4 02 00 00 64 4c 8b 0c 25 10 00 00 00 31 d2 4d 8d 91 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 3e 01 00 00 85 c0 41 89 c5 0f 85 45 01 00
[ T654] RSP: 002b:00007fffffffe260 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ T654] RAX: ffffffffffffffda RBX: 000000000000028e RCX: 00007ffff6336014
[ T654] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ T654] RBP: 00007fffffffe2c0 R08: 000000000000028e R09: 00007ffff7fe5800
[ T654] R10: 00007ffff7fe5ad0 R11: 0000000000000246 R12: 00007fffffffe260
[ T654] R13: 00007fffffffe280 R14: 000055555581c480 R15: 0000555555815e80
[ T654] Modules linked in: msg_socket(O)
[ T654] CR2: ffff8880594ed000
[ T654] ---[ end trace f72d8855a9e1315e ]---
[ T654] RIP: 0010:clear_page_orig+0x12/0x40
[ T654] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[ T654] RSP: 0018:ffffc900006cb988 EFLAGS: 00000216
[ T654] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[ T654] RDX: ffff8880596fc2c0 RSI: 00000000013893d8 RDI: ffff8880594ed000
[ T654] RBP: ffffc900006cbb00 R08: 0000000000000000 R09: 0000000001389410
[ T654] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[ T654] R13: ffffffff81cd9f40 R14: ffffea00013893d8 R15: ffffea00013893d8
[ T654] FS: 00007ffff7fe5800(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000
[ T654] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T654] CR2: ffff8880594ed000 CR3: 00000000596b6000 CR4: 00000000000006f0
[ T654] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ T654] in_atomic(): 1, irqs_disabled(): 1, pid: 654, name: sshd
[ T654] INFO: lockdep is turned off.
[ T654] irq event stamp: 10082
[ T654] hardirqs last enabled at (10081): [<ffffffff811d7044>] get_page_from_freelist+0xf4/0x13d0
[ T654] hardirqs last disabled at (10082): [<ffffffff8100196a>] trace_hardirqs_off_thunk+0x1a/0x1c
[ T654] softirqs last enabled at (9942): [<ffffffff818002ec>] __do_softirq+0x2ec/0x475
[ T654] softirqs last disabled at (9935): [<ffffffff8106856e>] irq_exit+0xbe/0xd0
[ T654] CPU: 0 PID: 654 Comm: sshd Tainted: G D O 5.2.0-rc4-popcorn+ #32
[ T654] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ T654] Call Trace:
[ T654] dump_stack+0x67/0x9b
[ T654] ___might_sleep+0x149/0x230
[ T654] exit_signals+0x30/0x240
[ T654] do_exit+0xb0/0xc30
[ T654] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ T654] rewind_stack_do_exit+0x17/0x20
[ T654] note: sshd[654] exited with preempt_count 1
[ T656] BUG: unable to handle page fault for address: ffff88805a320000
[ T656] #PF: supervisor write access in kernel mode
[ T656] #PF: error_code(0x000b) - reserved bit violation
[ T656] PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 5a17d063 PTE 800fffffa5cdf063
[ T656] Oops: 000b [#2] SMP NOPTI
[ T656] CPU: 0 PID: 656 Comm: sshd Tainted: G D W O 5.2.0-rc4-popcorn+ #32
[ T656] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ T656] RIP: 0010:clear_page_orig+0x12/0x40
[ T656] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[ T656] RSP: 0018:ffffc900005ff9d0 EFLAGS: 00000216
[ T656] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[ T656] RDX: ffff888059676300 RSI: 00000000013baf00 RDI: ffff88805a320000
[ T656] RBP: ffffc900005ffb48 R08: 0000000000000000 R09: 00000000013baf38
[ T656] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[ T656] R13: ffffffff81cd9f40 R14: ffffea00013baf00 R15: ffffea00013baf00
[ T656] FS: 00007ffff7fe5800(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000
[ T656] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T656] CR2: ffff88805a320000 CR3: 000000005a28a000 CR4: 00000000000006f0
[ T656] Call Trace:
[ T656] get_page_from_freelist+0x7dc/0x13d0
[ T656] ? lock_acquire+0xa6/0x1a0
[ T656] ? fs_reclaim_acquire.part.26+0x5/0x30
[ T656] __alloc_pages_nodemask+0x178/0xfa0
[ T656] ? __pmd_alloc+0xa9/0x170
[ T656] pte_alloc_one+0x17/0x70
[ T656] __pte_alloc+0x16/0x110
[ T656] __handle_mm_fault+0x8f2/0xcc0
[ T656] __get_user_pages+0x215/0x790
[ T656] get_user_pages_remote+0x158/0x210
[ T656] copy_strings+0x16b/0x2e0
[ T656] ? kernel_read+0x2c/0x40
[ T656] copy_strings_kernel+0x2c/0x40
[ T656] __do_execve_file+0x6c2/0xa60
[ T656] __x64_sys_execve+0x26/0x30
[ T656] do_syscall_64+0x69/0x440
[ T656] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ T656] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ T656] RIP: 0033:0x7ffff6336317
[ T656] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 40 ab 2e 00 f7 d8 64 89 02
[ T656] RSP: 002b:00007fffffffe2c8 EFLAGS: 00000217 ORIG_RAX: 000000000000003b
[ T656] RAX: ffffffffffffffda RBX: 000055555581c590 RCX: 00007ffff6336317
[ T656] RDX: 000055555581e090 RSI: 0000555555823f20 RDI: 000055555581e050
[ T656] RBP: 000055555581c588 R08: 0000000000000007 R09: 0000000000000008
[ T656] R10: 00007fffffffde01 R11: 0000000000000217 R12: 0000000000000000
[ T656] R13: 0000000000000004 R14: 000055555581c480 R15: 0000555555815e80
[ T656] Modules linked in: msg_socket(O)
[ T656] CR2: ffff88805a320000
[ T656] ---[ end trace f72d8855a9e1315f ]---
[ T656] RIP: 0010:clear_page_orig+0x12/0x40
[ T656] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[ T656] RSP: 0018:ffffc900006cb988 EFLAGS: 00000216
[ T656] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[ T656] RDX: ffff8880596fc2c0 RSI: 00000000013893d8 RDI: ffff8880594ed000
[ T656] RBP: ffffc900006cbb00 R08: 0000000000000000 R09: 0000000001389410
[ T656] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[ T656] R13: ffffffff81cd9f40 R14: ffffea00013893d8 R15: ffffea00013893d8
[ T656] FS: 00007ffff7fe5800(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000
[ T656] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T656] CR2: ffff88805a320000 CR3: 000000005a28a000 CR4: 00000000000006f0
[ T656] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ T656] in_atomic(): 1, irqs_disabled(): 1, pid: 656, name: sshd
[ T656] INFO: lockdep is turned off.
[ T656] irq event stamp: 0
[ T656] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ T656] hardirqs last disabled at (0): [<ffffffff8105f078>] copy_process.part.9+0x4d8/0x1bf0
[ T656] softirqs last enabled at (0): [<ffffffff8105f078>] copy_process.part.9+0x4d8/0x1bf0
[ T656] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ T656] CPU: 0 PID: 656 Comm: sshd Tainted: G D W O 5.2.0-rc4-popcorn+ #32
[ T656] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ T656] Call Trace:
[ T656] dump_stack+0x67/0x9b
[ T656] ___might_sleep+0x149/0x230
[ T656] exit_signals+0x30/0x240
[ T656] ? __x64_sys_execve+0x26/0x30
[ T656] do_exit+0xb0/0xc30
[ T656] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ T656] rewind_stack_do_exit+0x17/0x20
[ T656] note: sshd[656] exited with preempt_count 1
[ T657] BUG: unable to handle page fault for address: ffff88805a2fd000
[ T657] #PF: supervisor write access in kernel mode
[ T657] #PF: error_code(0x000b) - reserved bit violation
[ T657] PGD 2e01067 P4D 2e01067 PUD 2e04067 PMD 5a17d063 PTE 800fffffa5d02063
[ T657] Oops: 000b [#3] SMP NOPTI
[ T657] CPU: 0 PID: 657 Comm: sshd Tainted: G D W O 5.2.0-rc4-popcorn+ #32
[ T657] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ T657] RIP: 0010:clear_page_orig+0x12/0x40
[ T657] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[ T657] RSP: 0018:ffffc900005ff9d8 EFLAGS: 00000216
[ T657] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[ T657] RDX: ffff888059676300 RSI: 00000000013ba758 RDI: ffff88805a2fd000
[ T657] RBP: ffffc900005ffb50 R08: 0000000000000000 R09: 00000000013ba790
[ T657] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[ T657] R13: ffffffff81cd9f40 R14: ffffea00013ba758 R15: ffffea00013ba758
[ T657] FS: 00007ffff7fe5800(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000
[ T657] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T657] CR2: ffff88805a2fd000 CR3: 000000005a28a000 CR4: 00000000000006f0
[ T657] Call Trace:
[ T657] get_page_from_freelist+0x7dc/0x13d0
[ T657] ? lock_acquire+0xa6/0x1a0
[ T657] ? fs_reclaim_acquire.part.26+0x5/0x30
[ T657] __alloc_pages_nodemask+0x178/0xfa0
[ T657] ? lock_acquire+0xa6/0x1a0
[ T657] ? find_get_entry+0x5/0x300
[ T657] __get_free_pages+0x11/0x50
[ T657] __pud_alloc+0x2a/0xc0
[ T657] __handle_mm_fault+0x2b7/0xcc0
[ T657] __get_user_pages+0x215/0x790
[ T657] get_user_pages_remote+0x158/0x210
[ T657] copy_strings+0x16b/0x2e0
[ T657] ? kernel_read+0x2c/0x40
[ T657] copy_strings_kernel+0x2c/0x40
[ T657] __do_execve_file+0x6c2/0xa60
[ T657] __x64_sys_execve+0x26/0x30
[ T657] do_syscall_64+0x69/0x440
[ T657] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ T657] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ T657] RIP: 0033:0x7ffff6336317
[ T657] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de 64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 40 ab 2e 00 f7 d8 64 89 02
[ T657] RSP: 002b:00007fffffffe2c8 EFLAGS: 00000217 ORIG_RAX: 000000000000003b
[ T657] RAX: ffffffffffffffda RBX: 000055555581c590 RCX: 00007ffff6336317
[ T657] RDX: 000055555581e090 RSI: 0000555555823f20 RDI: 000055555581e050
[ T657] RBP: 000055555581c588 R08: 0000000000000007 R09: 0000000000000008
[ T657] R10: 00007fffffffde01 R11: 0000000000000217 R12: 0000000000000000
[ T657] R13: 0000000000000004 R14: 000055555581c480 R15: 0000555555815e80
[ T657] Modules linked in: msg_socket(O)
[ T657] CR2: ffff88805a2fd000
[ T657] ---[ end trace f72d8855a9e13160 ]---
[ T657] RIP: 0010:clear_page_orig+0x12/0x40
[ T657] Code: 00 b8 01 00 00 00 5b c3 b9 00 02 00 00 31 c0 f3 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 00 00 00 ff c9 <48> 89 07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[ T657] RSP: 0018:ffffc900006cb988 EFLAGS: 00000216
[ T657] RAX: 0000000000000000 RBX: dead000000000100 RCX: 000000000000003f
[ T657] RDX: ffff8880596fc2c0 RSI: 00000000013893d8 RDI: ffff8880594ed000
[ T657] RBP: ffffc900006cbb00 R08: 0000000000000000 R09: 0000000001389410
[ T657] R10: ffff888000000000 R11: 6db6db6db6db6db7 R12: 0000000000000010
[ T657] R13: ffffffff81cd9f40 R14: ffffea00013893d8 R15: ffffea00013893d8
[ T657] FS: 00007ffff7fe5800(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000
[ T657] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T657] CR2: ffff88805a2fd000 CR3: 000000005a28a000 CR4: 00000000000006f0
[ T657] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
[ T657] in_atomic(): 1, irqs_disabled(): 1, pid: 657, name: sshd
[ T657] INFO: lockdep is turned off.
[ T657] irq event stamp: 0
[ T657] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ T657] hardirqs last disabled at (0): [<ffffffff8105f078>] copy_process.part.9+0x4d8/0x1bf0
[ T657] softirqs last enabled at (0): [<ffffffff8105f078>] copy_process.part.9+0x4d8/0x1bf0
[ T657] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ T657] CPU: 0 PID: 657 Comm: sshd Tainted: G D W O 5.2.0-rc4-popcorn+ #32
[ T657] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ T657] Call Trace:
[ T657] dump_stack+0x67/0x9b
[ T657] ___might_sleep+0x149/0x230
[ T657] exit_signals+0x30/0x240
[ T657] ? __x64_sys_execve+0x26/0x30
[ T657] do_exit+0xb0/0xc30
[ T657] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ T657] rewind_stack_do_exit+0x17/0x20
[ T657] note: sshd[657] exited with preempt_count 1
That get_page_from_freelist error looks awfully familiar to the faults I encountered with the various PTE workarounds for the L1TF patches. I believe this issue will go away with the patch I posted in Issue #55.
That get_page_from_freelist error looks awfully familiar to the faults I encountered with the various PTE workarounds for the L1TF patches. I believe this issue will go away with the patch I posted in Issue #55.
I can confirm that the second patch did work. Thanks! Interestingly, on some systems which support the L1TF workaround, adding boot arguments to disable the mitigations didn't disable it.
I'll close these related issues so we can concatenate them into one issue (#84, #85, #89, #91 are fixed by reverting the patches). The Linux community probably wouldn't approve of us reverting the L1TF patches, I'll start looking into solutions might be more appealing to them.
After updating systemd (apt upgrade in the provided ubuntu img), the kernel encountered a page fault. A similar page fault happened while installing the package originally, in the future I'll see if I can get a trace for that (happened somewhere along the line while restarting systemd).
Using the current latest merge branch kernel.