Closed bxatnarf closed 5 years ago
After enabling semaphore/lock debugging in the kernel, here is part of its log:
[ 845.357480] ============================================
[ 845.357614] WARNING: possible recursive locking detected
[ 845.357803] 4.19.0-rc5-popcorn+ #2 Not tainted
[ 845.357939] --------------------------------------------
[ 845.358071] mt/694 is trying to acquire lock:
[ 845.358209] 00000000f8a5f27c (&mm->mmap_sem){++++}, at: vm_munmap+0x5c/0xc0
[ 845.358568]
[ 845.358568] but task is already holding lock:
[ 845.358723] 00000000f8a5f27c (&mm->mmap_sem){++++}, at: vm_munmap+0xa5/0xc0
[ 845.358932]
[ 845.358932] other info that might help us debug this:
[ 845.359111] Possible unsafe locking scenario:
[ 845.359111]
[ 845.359251] CPU0
[ 845.359319] ----
[ 845.359386] lock(&mm->mmap_sem);
[ 845.359396] lock(&mm->mmap_sem);
[ 845.359396]
[ 845.359396] *** DEADLOCK ***
[ 845.359396]
[ 845.359396] May be due to missing lock nesting notation
[ 845.359396]
[ 845.359396] 1 lock held by mt/694:
[ 845.359396] #0: 00000000f8a5f27c (&mm->mmap_sem){++++}, at: vm_munmap+0xa5/0xc0
[ 845.359396]
[ 845.359396] stack backtrace:
[ 845.359396] CPU: 0 PID: 694 Comm: mt Not tainted 4.19.0-rc5-popcorn+ #2
[ 845.359396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20181104
[ 845.359396] Call Trace:
[ 845.359396] dump_stack+0x67/0x90
[ 845.359396] __lock_acquire.cold.43+0x1d1/0x2f0
[ 845.359396] ? sched_clock_local+0x12/0x80
[ 845.359396] ? wait_for_common_io.constprop.2+0x12f/0x160
[ 845.359396] lock_acquire+0xb8/0x180
[ 845.359396] ? vm_munmap+0x5c/0xc0
[ 845.359396] down_write_killable+0x4d/0xd0
[ 845.359396] ? vm_munmap+0x5c/0xc0
[ 845.359396] vm_munmap+0x5c/0xc0
[ 845.359396] vma_server_munmap_origin+0x9b/0x110
[ 845.359396] __x64_sys_munmap+0x6f/0x80
[ 845.359396] do_syscall_64+0x69/0x400
[ 845.359396] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 845.359396] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 845.359396] RIP: 0033:0x44408b
[ 845.359396] Code: c7 c2 b8 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff e9 77 ff ff ff8
[ 845.359396] RSP: 002b:00007fffffffe1b8 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[ 845.359396] RAX: ffffffffffffffda RBX: 00007ffff7ff99c0 RCX: 000000000044408b
[ 845.359396] RDX: 0000000000000000 RSI: 0000000000801000 RDI: 00007ffff77f9000
[ 845.359396] RBP: 00007ffff77f89c0 R08: 0000000000000001 R09: 0000000000000000
[ 845.359396] R10: 0000000000000000 R11: 0000000000000202 R12: 00000000004c20f0
[ 845.359396] R13: 0000000002800000 R14: 0000000000000000 R15: 0000000000000000
Here is more information regarding the lock contention identified in the above comment:
The addresses of the instructions in question:
(gdb) print/x vm_munmap+0x5c
$6 = 0xffffffff811b63fc
(gdb) print/x vm_munmap+0xa5
$7 = 0xffffffff811b6445
(gdb)
This corresponds to the following lines of code:
> addr2line -e vmlinux 0xffffffff811b63fc
/home/user/popcorn/linux-x86/mm/mmap.c:2835
> addr2line -e vmlinux 0xffffffff811b6445
/home/user/popcorn/linux-x86/mm/mmap.c:2830
Which in my build is:
#ifdef CONFIG_POPCORN
if (distributed_process(current)) {
while (!down_write_trylock(&mm->mmap_sem))
schedule();
}
#endif
if (down_write_killable(&mm->mmap_sem))
return -EINTR;
Looks like some down_write
s got lost in the merge. (Argh.)
Looks like part of this issue arose from changes made in upstream Linux's commit https://github.com/torvalds/linux/commit/dc0ef0df7b6a90892ec41933212ac701152a254c where instances of down_write
were replaced by down_write_killable
. I am not sure if this is the correct corse of action in our case, but I'll use down_write_killable
.
This seems to be related to/the root cause of https://github.com/ssrg-vt/popcorn-kernel/issues/63 Branch: next Platforms: Two x86 instances Binary: popcorn-kernel-lib's
mt
mt
does not seem to ever exit. This may be due to incorrect usage its task struct's mm.mmap_sem. After the process hangs (in this case pid 671), this semaphore looks as follows: