none/tests/pth_self_kill_15_other is failing [x86, clang and gcc]

paulfloyd commented 4 years ago

On amd64 with --trace-syscalls=yes I see

SYSCALL[14971,1](433) sys_thr_kill ( 101269, 15 )--14971-- thr_kill: sending signal 15 to tid 101269
 --> [async] ... 
SYSCALL[14971,1](433) ... [async] --> Success(0x0) --14971-- thr_kill: sent signal 15 to tid 101269

But on i386 this is

YSCALL[93600,1](433) sys_thr_kill ( 100753, 15 )--93600-- thr_kill: sending signal 15 to tid 100753
 --> [async] ... 
--93600-- async signal handler: signal=15, tid=2, si_code=65543, exitreason VgSrc_None
--93600-- interrupted_syscall: tid=2, ip=0x380e299d, restart=False, sres.isErr=True, sres.val=4
--93600--   completed, but uncommitted: committing
--93600-- delivering signal 15 (SIGTERM):65543 to thread 2
--93600-- push_signal_frame (thread 2): signal 15
==93600==    at 0x6B10C43: _nanosleep (in /lib/libc.so.7)
==93600==    by 0x6A7EA94: sleep (in /lib/libc.so.7)
==93600==    by 0x8048857: t (pth_self_kill.c:17)
==93600==    by 0x69F188A: ??? (in /lib/libthr.so.3

On i386 in gdb if I put a breakpoint on async_signalhandler then the callstack is

#0  async_signalhandler (sigNo=15, info=0x71bdcc0, uc=0x71bda00) at m_signals.c:2505
#1  <signal handler called>
#2  vgModuleLocal_do_syscall_for_client_WRK () at m_syswrap/syscall-x86-freebsd.S:134
#3  0x3808d147 in do_syscall_for_client (syscall_mask=0x71bde1c, tst=<optimized out>, syscallno=<optimized out>) at m_syswrap/syswrap-main.c:368
#4  vgPlain_client_syscall (tid=<optimized out>, trc=<optimized out>) at m_syswrap/syswrap-main.c:2277
#5  0x38089cc1 in handle_syscall (tid=tid@entry=2, trc=77) at m_scheduler/scheduler.c:1211
#6  0x3808b372 in vgPlain_scheduler (tid=<optimized out>) at m_scheduler/scheduler.c:1529
#7  0x38097f72 in thread_wrapper (tidW=2) at m_syswrap/syswrap-freebsd.c:105
#8  run_a_thread_NORETURN (tidW=2) at m_syswrap/syswrap-freebsd.c:159

This is not easy to debug. I don't see problems when running under gdb (or lldb). Also, 32on64 works OK.

My impressions so far are

do_syscall_for_client_WRK in syscall-x86-freebsd.S should unblock signals, deliver the signal and then block the signals
- if there is memory corruption and the signal mask is false, this will change the behaviour.
- I did suspect that the test for tst->exitreason in async_handler could be the cause of the problem.

paulfloyd commented 4 years ago

I can't reproduce this for a 32bit binary running on amd64 kernel.

Further, this isn't related to issue #122

paulfloyd commented 4 years ago

Debugging this a bit more, and I saw the following

pthread_kill seems to run OK
gdb handles a sigterm for tid 2, not sure what that means
we get to VG_(nuke_all_threads_except) and start looping over all of the threads
VG_(get_thread_out_of_syscall) is called for tid 2
then we jump to the asm signal delivery code, because of the sigterm?

amd64 does the same for the first four points, but for the last there is no jump.

paulfloyd commented 4 years ago

ktrace seems to provide interesting information. Running standalone I get (from thread kill onwards)

 92188 pth_self_kill CALL  thr_kill(0x189a7,SIGTERM)
 92188 pth_self_kill RET   thr_kill 0
 92188 pth_self_kill CALL  sigprocmask(SIG_SETMASK,0x2809058c,0xffbfeae8)
 92188 pth_self_kill RET   sigprocmask 0
 92188 pth_self_kill CALL  sigaction(SIGTERM,0xffbfeab8,0xffbfeaa0)
 92188 pth_self_kill RET   sigaction 0
 92188 pth_self_kill CALL  sigprocmask(SIG_SETMASK,0xffbfeae8,0)
 92188 pth_self_kill RET   sigprocmask 0
 92188 pth_self_kill CALL  mmap(0,0x20000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,0xffffffff,0,0)
 92188 pth_self_kill RET   mmap 673558528/0x2825b000
 92188 pth_self_kill CALL  exit(0)
 92188 pth_self_kill RET   nanosleep -1 errno 4 Interrupted system call
 92188 pth_self_kill PSIG  SIGTERM SIG_DFL code=SI_LWP

For 32on64 I get (deleting a lot of stuff like sigprocmask and thr_self

  5899 none-x86-freebsd CALL  thr_kill(0x18dbe,SIGTERM)
  5899 none-x86-freebsd RET   thr_kill 0
  5899 none-x86-freebsd PSIG  SIGTERM caught handler=0x380d9d10 mask=0x0 code=SI_LWP
  5899 none-x86-freebsd CALL  mmap(0x60ef000,0x20000,0x3<PROT_READ|PROT_WRITE>,0x1012<MAP_PRIVATE|MAP_FIXED|MAP_ANON>,0xffffffff,0,0)
  5899 none-x86-freebsd RET   mmap 101642240/0x60ef000
  5899 none-x86-freebsd CALL  thr_kill(0x18e18,SIG 128)
  5899 none-x86-freebsd RET   thr_kill 0
  5899 none-x86-freebsd RET   nanosleep -1 errno 4 Interrupted system call
  5899 none-x86-freebsd CALL  thr_self(0x4eacd9c)
  5899 none-x86-freebsd PSIG  SIG -128 caught handler=0x380d9f00 mask=0x0 code=SI_LWP
  5899 none-x86-freebsd CALL  thr_exit(0x2)
  5899 none-x86-freebsd CALL  exit(0x2)

And pure x86

 92213 none-x86-freebsd CALL  thr_kill(0x18710,SIGTERM)
 92213 none-x86-freebsd RET   thr_kill 0
 92213 none-x86-freebsd RET   nanosleep -1 errno 4 Interrupted system call
 92213 none-x86-freebsd PSIG  SIGTERM caught handler=0x38049a50 mask=0x0 code=SI_LWP
 92213 none-x86-freebsd PSIG  SIGSEGV caught handler=0x3804a440 mask=0xfffef067 code=SEGV_MAPERR
 92213 none-x86-freebsd CALL  kill(0x16835,SIGSEGV)
 92213 none-x86-freebsd RET   kill 0
 92213 none-x86-freebsd PSIG  SIGSEGV SIG_DFL code=SI_USER

If I can summarize that

Standalone

thr_kill
exit
nanosleep interrupted
sigterm

32on64

thr_kill
catch sigterm
send VGKILL
nanosleep interrupted
VGKILL
thr_exit

Pure x86

thr_kill
nanosleep interrupted
catch sigterm
catch sigsegv
kill
sigsevg
core

paulfloyd commented 4 years ago

Some similarities with issue #136

I do see these from ktrace

8166 none-x86-freebsd RET sigtimedwait -1 errno 35 Resource temporarily unavailable

https://stackoverflow.com/questions/17012206/catching-sigchld-using-sigtimedwait-on-bsd

Quick and dirty attempt, but it doesn't seem to fix anything

Int VG_(sigtimedwait_zero)( const vki_sigset_t *set, 
                            vki_siginfo_t *info )
{
   /*
   static const struct vki_timespec zero = { 0, 0 };

   SysRes res = VG_(do_syscall3)(__NR_sigtimedwait, (UWord)set, (UWord)info,
                                   (UWord)&zero);
   return sr_isError(res) ? -1 : sr_Res(res);
   */

   SysRes res = VG_(do_syscall0)(__NR_kqueue);
   int kq = sr_Res(res);
   struct kevent ke;
   struct timespec zero = { 0, 0 };

   EV_SET(&ke, set->sig[0], EVFILT_SIGNAL, EV_ADD, 0, 0, NULL);
   VG_(do_syscall6)(__NR_kevent, kq, (UWord)&ke, 1, (UWord)NULL, 0, (UWord)NULL);
   res = VG_(do_syscall6)(__NR_kevent, kq, (UWord)NULL, 0, (UWord)&ke, 1, (UWord)&zero);
   VG_(do_syscall1)(__NR_close, kq);
   return sr_isError(res) ? -1 : sr_Res(res);
}

paulfloyd commented 2 years ago

Also looks good with https://bugs.kde.org/show_bug.cgi?id=445032

paulfloyd / freebsd_valgrind

none/tests/pth_self_kill_15_other is failing [x86, clang and gcc] #83