kontainapp / km

Kontain Monitor
http://kontain.app
Apache License 2.0
28 stars 5 forks source link

popen test hangs on a lock forever #1599

Open sv641 opened 2 years ago

sv641 commented 2 years ago

popen test hangs occasionally with kvm running on ubuntu 20.04 guest running on windows 10 hyper-v host

no core file is generated

test command is popen_test.kmd /etc/group /tmp/f13839098 /tmp/f23839098


(gdb) bt
#0  0x00007f71463142fb in __lll_lock_wait_private ()
#1  0x00007f714632f876 in fputs ()
#2  0x00007f7146304ce3 in __km_trace (errnum=errnum@entry=0,
    function=function@entry=0x7f71463dfa00 <__FUNCTION__.9508> "km_dofork", linenumber=linenumber@entry=363,
    fmt=fmt@entry=0x7f71463df94f "child: after fork/clone") at km/km_trace.c:137
#3  0x00007f71462ea05c in km_dofork (in_child=0x0) at km/km_fork.c:363
#4  0x00007f71462d70d1 in main (argc=<optimized out>, argv=<optimized out>) at km/km_main.c:641
(gdb) info threads
  Id   Target Id            Frame
* 1    process 3839205 "km" 0x00007f71463142fb in __lll_lock_wait_private ()
(gdb) p/x km_fork_state
$1 = {mutex = {__data = {__lock = 0x1, __count = 0x0, __owner = 0x3a94e2, __nusers = 0x1, __kind = 0x0, __spins = 0x0,
      __elision = 0x0, __list = {__prev = 0x0, __next = 0x0}}, __size = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xe2,
      0x94, 0x3a, 0x0, 0x1, 0x0 <repeats 27 times>}, __align = 0x1}, cond = {__data = {{__wseq = 0x0, __wseq32 = {
          __low = 0x0, __high = 0x0}}, {__g1_start = 0x0, __g1_start32 = {__low = 0x0, __high = 0x0}}, __g_refs = {
        0x0, 0x0}, __g_size = {0x0, 0x0}, __g1_orig_size = 0x0, __wrefs = 0x0, __g_signals = {0x0, 0x0}}, __size = {
      0x0 <repeats 48 times>}, __align = 0x0}, is_clone = 0x1, fork_in_progress = 0x1, arg = 0x7fffdfc030,
  km_parent_pid = 0x3a94e2, regs = {rax = 0x0, rbx = 0x7fffdfc140, rcx = 0x7fffdfc090, rdx = 0x8038,
    rsi = 0x7fffdfd690, rdi = 0x255420, rsp = 0x7fffdfd658, rbp = 0x7fffdfd6e4, r8 = 0x0, r9 = 0x7fffdfde50,
    r10 = 0x0, r11 = 0x0, r12 = 0x7fffdfc098, r13 = 0x7fffdfc090, r14 = 0x7fffdfd710, r15 = 0x263795, rip = 0x25cef1,
    rflags = 0x6}, stack_top = 0x7fffdfd690, guest_thr = 0x2b0fe8, sigaltstack = {ss_sp = 0x0, ss_flags = 0x0,
    ss_size = 0x0}, sigmask = 0xffffffffffffffff, ksi_valid = 0x0, ksi = {data = {0x0 <repeats 64 times>}},
  kx_valid = 0x0, kx = {data = {0x0 <repeats 4096 times>}}}
(gdb)
paulpopelka commented 2 years ago

I looked into this problem for a while. I found that the code was calling fputs() on km_log_file even though the line numbers in the backtrace seemed to indicate we were writing on stderr (which is closed in this case). There is a lock inside of the FILE structure that is apparently held by another thread. I ran pthread_self() from gdb to find out the thread's pthread_t and compared with the lock owner field in the lock.
That would seem to be impossible since there are no other threads in the process. In addition I looked at the fork code in glibc to see what it did about reseting these FILE structure locks in the child and found that it has code to handle that. Look for the function fresetlockfiles() in file posix/fork.c

I don't know the cause of this problem.

gnode1 commented 2 years ago

Is it possible the lock was taken in the parent before fork() duplicated the memory? In other words child inherits the lock in already taken state.

I understand there isn't enough data to prove or disprove this hypothesis. We could sort of guess by the numeric value of the lock owner, maybe...

paulpopelka commented 2 years ago

Is it possible the lock was taken in the parent before fork() duplicated the memory? In other words child inherits the lock in already taken state.

I understand there isn't enough data to prove or disprove this hypothesis. We could sort of guess by the numeric value of the lock owner, maybe...

I think fork() in glibc has code to handle the problem. The function fresetlockfiles() is called to have all of the FILE structure locks marked as unlocked. And, the locks in the FILE structure are futex's. The stdio functions don't use pthread_mutex's for the lock that is causing trouble.