Closed bxatnarf closed 4 years ago
If popcorn's page_server_zap_pte
hook is commented out from zap_pte_range
, (i.e., this line) then the remote host doesn't get into deadlock. Instead, the remote host encounters a bad pte during its call to unmap_page_range
, and the following stack gets dumped
[ 88.206060] addr:(____ptrval____) vm_flags:00100173 anon_vma:(____ptrval____) mapping: (null) index:ffffdc32f
[ 88.206630] file: (null) fault: (null) mmap: (null) readpage: (null)
[ 88.206936] CPU: 0 PID: 261 Comm: stack Tainted: G B W 4.20.0-rc7-popcorn+ #96
[ 88.207170] Hardware name: linux,dummy-virt (DT)
[ 88.207309] Call trace:
[ 88.207403] dump_backtrace+0x0/0x1c8
[ 88.207527] show_stack+0x24/0x30
[ 88.207634] dump_stack+0xbc/0xf4
[ 88.207750] print_bad_pte+0x18c/0x1e0
[ 88.207886] unmap_page_range+0x224/0x968
[ 88.208016] unmap_single_vma+0x8c/0xa0
[ 88.208138] unmap_vmas+0x60/0x78
[ 88.208247] exit_mmap+0xc8/0x170
[ 88.208361] mmput+0x74/0x118
[ 88.208466] do_exit+0x3e0/0xaf0
[ 88.208578] kthread+0xf0/0x138
[ 88.208685] ret_from_fork+0x10/0x1c
According to Ho-Ren, this could also be an issue with get_normal_page
/vm_normal_page
which manipulates the lru cache. Given that get_normal_page
has changed quite a bit since the vanilla popcorn version there is a real possibility here -- see this git blame of vm_normal_page that shows the changes since linux 4.4
Going to close this after 1f12a34a25a122f6b0e512b0326b5b199daf215c. Thanks!
Arch: migration between x86-64 hosts and migration between arm64 hosts Branch: merge, commit 4b4f4339d8513b0f1dcda540da4e9a7fc28c433e (latest as of the creation of this bug report) Example tested:
bt
You may have to run it multiple times to trigger deadlock. In the directory from which you executebt
you must create a file calledinputbt.data
that contains:You can grab prebuilt
bt
binaries for arm and x86. If you want to build your own you will first have to rebuild the migration library (popcorn-kernel-lib) with this patch.Kernel config:
kernel log on remote (x86):
x86 backtrace on remote during deadlock:
There is no relevant messages in the kernel log on arm. arm remote's backtrace while stuck in deadlock
The deadlock is caused by there being a loop in a
page->lru
linked list causing it to never finish releasing pages on exit. I do not know the cause of this loop.