none/tests/sisgsusp is hanging [i386]

nbriggs commented 4 years ago

Spent a while staring at this one. As far as I can see, sending SIGVGKILL to the suspended thread from VG_(tkill) via

   res = VG_(do_syscall2)(__NR_thr_kill, lwpid, signo);

doesn't seem to actually wake the thread in question. In PRE(sys_exit) It calls nuke_all_threads_except() and then goes into reap_threads(), where it hangs since the signaled thread doesn't wake and return.

paulfloyd commented 4 years ago

That corresponds to what I've seen under GDB.

I did try to fix sigsuspend. Initially the FreeBSD syswrap was just a bad copy and paste of the Linux rt_sigsuspend.

https://github.com/paulfloyd/freebsd_valgrind/commit/d050d0169d68950466ba941abc13f5debbaa620d

[Deleted remarks referring to obsolete 4.3 sigsuspend]

nbriggs commented 4 years ago

I do think it has something to do with the way it's removing the SIGVGKILL (128) from the signal set, because if I change the sigsusp.c source to not block the SIGVGKILL then it all works fine.

$ valgrind --tool=none --trace-signals=yes /tmp/sigsusp
==33021== Nulgrind, the minimal Valgrind tool
==33021== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote.
==33021== Using Valgrind-3.17.0.GIT and LibVEX; rerun with -h for copyright info
==33021== Command: /tmp/sigsusp
==33021== 
--33021-- Max kernel-supported signal is 128, VG_SIGVGKILL is 128
--33021-- sync signal handler: signal=11, si_code=1, EIP=0x4413700, eip=0x4ebe5ac, from kernel
--33021-- SIGSEGV: si_code=1 faultaddr=0xfbbfd9b3 tid=1 ESP=0xfbbfd5b0 seg=0xfabff000-0xfbbfdfff
--33021--        -> extended stack base to 0xfbbfd000
--33021-- do_setmask: tid = 1 how = 1 (SIG_BLOCK), newset = 0x441FA28 (fffffffffffffffffffffffffffff107)
--33021--   oldset=0xFBBFE2B0 00000000000000000000000000000000
--33021-- do_setmask: tid = 1 how = 3 (SIG_SETMASK), newset = 0x441FA3C (00000000000000000000000000000000)
--33021-- do_setmask: tid = 1 how = 1 (SIG_BLOCK), newset = 0x441FA28 (fffffffffffffffffffffffffffff107)
--33021--   oldset=0xFBBFE260 00000000000000000000000000000000
--33021-- do_setmask: tid = 1 how = 3 (SIG_SETMASK), newset = 0x441FA3C (00000000000000000000000000000000)
--33021-- do_setmask: tid = 1 how = 1 (SIG_BLOCK), newset = 0x441FA28 (fffffffffffffffffffffffffffff107)
--33021--   oldset=0xFBBFDE50 00000000000000000000000000000000
--33021-- do_setmask: tid = 1 how = 3 (SIG_SETMASK), newset = 0x441FA3C (00000000000000000000000000000000)
--33021-- do_setmask: tid = 1 how = 1 (SIG_BLOCK), newset = 0x441FA28 (fffffffffffffffffffffffffffff107)
--33021--   oldset=0xFBBFDE50 00000000000000000000000000000000
--33021-- do_setmask: tid = 1 how = 3 (SIG_SETMASK), newset = 0x441FA3C (00000000000000000000000000000000)
--33021-- sys_sigaction: sigNo 32, new 0x4eacdc0, old 0x0, new flags 0x40
--33021-- do_setmask: tid = 1 how = 2 (SIG_UNBLOCK), newset = 0xFBBFDDE0 (00000000000000000000000080000000)
--33021-- do_setmask: tid = 1 how = 1 (SIG_BLOCK), newset = 0x5F2158C (ffffffffffffffffffffffffffffffff)
--33021--   oldset=0x60E70A0 00000000000000000000000000000000
--33021-- do_setmask: tid = 1 how = 3 (SIG_SETMASK), newset = 0x60E70A0 (00000000000000000000000000000000)

--33021-- get_thread_out_of_syscall zaps tid 2 lwp 100546
--33021-- sigvgkill for lwp 100546 tid 2
--33021-- poll_signals: not polling as thread 2 is exitreason VgSrc_ExitProcess
==33021==

vs the original

--33035-- do_setmask: tid = 1 how = 1 (SIG_BLOCK), newset = 0x5F2158C (ffffffffffffffffffffffffffffffff)
--33035--   oldset=0x60F70A0 00000000000000000000000000000000
--33035-- do_setmask: tid = 1 how = 3 (SIG_SETMASK), newset = 0x60F70A0 (00000000000000000000000000000000)
--33035-- get_thread_out_of_syscall zaps tid 2 lwp 100543

where nothing at all happens in response to the get_thread_out_of_syscall

paulfloyd commented 4 years ago

I did wonder if the fact that the argument is const might cause a problem. The libc source for sigsuspend isn't too clear:

sigsuspend(const sigset_t *set)
{

    return (((int (*)(const sigset_t *))
        __libc_interposing[INTERPOS_sigsuspend])(set));
}

I added a printf to PRE(sys_sigsuspend) then ran everything though ktrace -di. The pointers were the same:

ktrace 10737 none-x86-freebsd CALL sigsuspend(0x6900f50) printf "DEBUG: delete VG_SIGVGKILL, ptr is 6900f50

Time to do some kernel digging

paulfloyd commented 4 years ago

I have finally made a bit of progress with this issue.

In vgSysWrap_freebsd_sys_sigsuspend_before [that's PRE(sys_sigsuspend)] I noted the address of the 4th unsigned int of the sig array in the sigset_t* argument. Then I added a hardware breakpoint

watch (vki_uint32_t ) 0x6900f5c

and then continued

That breakpoint stopped here (in syswrap-main.c)

#elif defined(VGP_x86_freebsd)
   VexGuestX86State* gst = (VexGuestX86State*)gst_vanilla;
   UWord *stack = (UWord *)gst->guest_ESP;

   // stack[0] is a (fake) return address
   switch (canonical->klass) {
   case VG_FREEBSD_SYSCALL0:
      gst->guest_EAX = __NR_syscall;
      stack[1] = canonical->sysno;
      stack++;
      break;
   case VG_FREEBSD_SYSCALL198:
      gst->guest_EAX = __NR___syscall;
      stack[1] = canonical->sysno;
      stack += 2;
      break;
   default:
      gst->guest_EAX = canonical->sysno;
      break;
   }
   stack[1] = canonical->arg1;
   stack[2] = canonical->arg2;
   stack[3] = canonical->arg3;
   stack[4] = canonical->arg4;
   stack[5] = canonical->arg5;
   stack[6] = canonical->arg6; // <= stopped here
   stack[7] = canonical->arg7;
   stack[8] = canonical->arg8;

The stop is just after the change has happened, so it was the stack[5] = arg5 that overwrote the values.

Furthermore

Old value = 2147483647 New value = 4294967295

So clearly that's restoring the VGKILL bit in the mask.

Unfortunately this is with a default build of Valgrind on x86 (so -g -O2). So it's hard to see most of the variables ('value has been optimized out'). So I can't tell what the syscall is or the klass. It should be enough to go on to debug more.

nbriggs commented 4 years ago

I recompiled with "-Og -g", and did:

Breakpoint 1 (syswrap-main.c:962 if stack[5]==2147483647 && canonical->arg5 == 4294967295) pending.
[New LWP 100525 of process 12672]
[Switching to LWP 100525 of process 12672]

Thread 2 hit Breakpoint 1, putSyscallArgsIntoGuestState (canonical=0x429ade0, gst_vanilla=<optimized out>) at m_syswrap/syswrap-main.c:962
962    stack[5] = canonical->arg5;
(gdb) print canonical
$1 = (SyscallArgs *) 0x429ade0
(gdb) print *canonical
$2 = {klass = 0, sysno = 341, arg1 = 122687312, arg2 = 2147483647, arg3 = 4294967295, arg4 = 4294967295, arg5 = 4294967295, arg6 = 0, arg7 = 122687412, arg8 = 71323705}
(gdb) continue

edited: I forgot to include a stack trace:

(gdb) where
#0  putSyscallArgsIntoGuestState (canonical=0x429ade0, gst_vanilla=<optimized out>) at m_syswrap/syswrap-main.c:962
#1  0x3810940f in vgPlain_client_syscall (tid=2, trc=77) at m_syswrap/syswrap-main.c:2290
#2  0x381074f6 in handle_syscall (tid=2, trc=77) at m_scheduler/scheduler.c:1211
#3  0x38105a9b in vgPlain_scheduler (tid=2) at m_scheduler/scheduler.c:1529
#4  0x38125992 in thread_wrapper (tidW=2) at m_syswrap/syswrap-freebsd.c:107
#5  0x381132a3 in run_a_thread_NORETURN (tidW=<optimized out>) at m_syswrap/syswrap-freebsd.c:161
#6  0x3811324b in vgModuleLocal_start_thread_NORETURN (arg=<optimized out>) at m_syswrap/syswrap-freebsd.c:276

nbriggs commented 4 years ago

So it looks as though the PRE(sys_sigsuspend) isn't affecting the args that are actually going to get used by putSyscallArgsIntoGuestState(), though there are comments all over syswrap-main.c:2286 about this...

paulfloyd commented 4 years ago

So as expected this is a standard syscall (klass is 0). It's for sigsuspend (sysno 341), so not long after the wrapper.

Time to waffle a bit.

When we're in t_fn I expect that the stack looks like this

4 bytes saved frame pointer ebp
4 bytes v* argument
4 bytes return address to thread creation function
16 bytes local variable mask

For the call to sigsuspend, this will push the pointer to mask and the return address to t_fn

4 bytes saved frame pointer ebp
4 bytes v* argument
4 bytes return address to thread creation function
16 bytes local variable mask   stack[2] to stack[5]
4 bytes pointer to mask          stack[1]
4 bytes return address to f_n  stack[0]

Systematically writing 8 words onto the stack makes no sense to me. For this to work (and I haven't checked) the stack would need to look like

4 bytes saved frame pointer ebp
4 bytes v* argument
4 bytes return address to thread creation function
16 bytes local variable mask
28 bytes unused space           stack[2] to stack[8]
4 bytes pointer to mask          stack[1]
4 bytes return address to f_n  stack[0]

The way that I think that this should work is to have a big table of syscalls arg counts;

int VG_(syscall_arg_counts)[__NR_max_syscall];

VG_(syscall_arg_counts)[__NR_exit] = 1;
// hundreds more
VG_(syscall_arg_counts)[__NR_sigsuspend] = 1;
// and again hundreds more

Then in syswrap_main.c

switch (VG_(syscall_arg_counts)[canonical->sysno]) {
case 8:
   stack[8] = canonical->arg8;
case 7:
   stack[7] = canonical->arg7;
// etc
case 1:
   stack[1] = canonical->arg1;
case 0:
   // do nothing
   break;
}

paulfloyd commented 4 years ago

It seems to work with the following push. This testcase no longer hangs. I need to test this change a bit more before I close this issue. Sadly no other failures fixed. commit 205bc3b75d347424c6e98c6a8a3e9dbd908d13f1

paulfloyd commented 4 years ago

----- Original Message -----

I don't see why not changing the values in stack[n] for n>argcount would fix this. It's not as though the matching of arguments to stack entry changed. The ones that aren't used would be ignored, right?

As usual, it depends.

In the case of pass-by-value arguments, yes. But in this case it is pass-by-pointer.

int sigsuspend(const sigset_t *sigmask);

Here the pointer is on the stack and also indirectly the contents that are pointed to.

However I can't believe that this code is doing any stack smashing for syscalls with fewer than 8 arguments (which is virtually all of them). So I suspect that in general the code is redundant, i.e, stack[x] and canonical->argx are already equal. But then in the PRE we sneak in a change to stack[5], which then gets overwritten with the original value in putSyscallArgsIntoGuestState.

If I knew that this is the only syscall that is making a change in the PRE I could just write

if (canonical->sysno != __NR_sigsuspend) { stack[5] = canonical->arg5; }

(with a long winded comment).

nbriggs commented 4 years ago

Thanks for the explanation, it makes more sense now. I realized after I wrote the note you responded to that I had not fully absorbed your stack description above and the fact that the syswrap-main.c code was extracting the actual stack pointer to use as the base for stack[n].

nbriggs commented 4 years ago

I notice in syswrap-main.c:2299, where it used to assert that the args weren't modified --

         /* Gack.  More impedance matching.  Copy the possibly                                                          
            modified syscall args back into the guest state. */
         /* JRS 2009-Mar-16: if the syscall args are possibly modified,                                                 
            then this assertion is senseless:                                                                           
              vg_assert(eq_SyscallArgs(&sci->args, &sci->orig_args));                                                   
            The case that exposed it was sys_posix_spawn on Darwin,                                                     
            which heavily modifies its arguments but then lets the call                                                 
            go through anyway, with SfToBlock set, hence we end up here. */
         putSyscallArgsIntoGuestState( &sci->args, &tst->arch.vex );

paulfloyd commented 4 years ago

OK, so that confirms what I thought.

I'm not too happy with the big change that I made yesterday. In theory it should be sure, but those hundreds of arg counts are a maintenance nightmare.

So I just tried putting in asserts that all of the arg values match. This testcase failed, as expected. Also

memcheck/tests/freebsd/scalar (stderr) none/tests/ppoll_alarm (stderr) none/tests/pselect_alarm (stderr)

For ppoll, it is arg4 and for pselect it is arg6. They are also masking VGKILL, but they do it by allocating a new mask and changing the pointer. I'll need to debug those as well to ensure they don't have a problem.

nbriggs commented 4 years ago

It does feel a bit brittle. It also makes me wonder (I haven't checked) if there are issues on the read access to the stack as well -- say the current SP was early in the first allocated stack page and the VG code accessed 8 words higher which was off the end of the stack page.

It seems as though the arg counts could be programmatically generated from the syscalls.master, but that looks pretty brittle as well.

The thought occurs to me that there's no guarantee that the sigset_t * passed to sigsuspend isn't in read-only memory. Allocating a new mask and changing the pointer may be a better solution?

paulfloyd commented 4 years ago

Good points.

The best I could do for the existing fix was to add some asserts, so just for VGP_x86_freebsd

+#define PRE_REG_READ1(tr, s, t1, a1) \
+   vg_assert(check_syscall_arg_count(1, arrghs->sysno)); \
+   if (VG_(tdict).track_pre_reg_read) { \
+      PRRSN; \
+      PRA1(s,t1,a1);                            \
+   }

etc for all 8 macros and then

+int check_syscall_arg_count(int expected, int sysno)
+{
+   switch (sysno)
+   {
+   case __NR_open:
+      return expected == 2 || expected == 3;
+   case __NR_openat:
+      return expected == 3 || expected == 4;
+   case __NR__umtx_op:
+      return expected >= 2 && expected <= 5;
+   case __NR_fcntl:
+      return expected >= 2 && expected <= 4;
+   case __NR_freebsd7___semctl:
+      return expected == 3 || expected == 4;
+   case __NR_fake_sigreturn:
+      return expected == 1;
+   case __NR___semctl:
+      return expected == 3 || expected == 4;
+   case __NR_kmq_open:
+      return expected == 3 || expected == 4;
+   case __NR_quotactl:
+      return expected == 2 || expected == 4;
+   default:
+      return expected == ML_(syscall_arg_counts)[sysno];
+   }
+}

I would rather that all of the mask twiddling syscalls use the same mechanism for pointers to mask. To date I haven't made much if any effort to factor out common functionality, and now might be the moment to start with a 'mask_vgkill()' function.

paulfloyd commented 4 years ago

With a bit of debugging, ppoll and pselect seem to be doing the right thing. That is, the PRE overwrites one of the args on the stack and then putSyscallArgsIntoGuestState redundantly does exactly the same overwrite.

So the one and only problem is sigsuspend on x86.

paulfloyd commented 4 years ago

Fixed with this change commit 001a43d9a8c8e2e27056c0d5a043c5995028f33d

nbriggs commented 4 years ago

Looks like a good fix, and no ongoing maintenance headaches. Nice job.

paulfloyd commented 4 years ago

I don't like disabling failing tests, so I was still manually typing kill every regtest run. I'm glad that's over.

nbriggs commented 4 years ago

Ah, I didn't disable the test but I did rewrite the sigsusp.c to do the "sigdelset(&mask, 128);" itself after I got tired of doing the kill ;-)

paulfloyd / freebsd_valgrind

none/tests/sisgsusp is hanging [i386] #59