paulfloyd / freebsd_valgrind

Git repo used to Upstream the FreeBSD Port of Valgrind
GNU General Public License v2.0
15 stars 4 forks source link

pth_cancel_locked hangs on FreeBSD 13.2 amd64 (single CPU VM) #199

Open paulfloyd opened 1 year ago

paulfloyd commented 1 year ago

Dang!

Does it hang standalone? No.

Looks like a race? With tracing, no hang.

paulfloyd commented 1 year ago

In gdb if I interrupt when it has hanged it is here

│      139  2:   syscall                  // GDB on next line means hang is HERE
│  >   144     setc  0(%rsp)         /* stash returned carry flag */                          

in the asm syscall wrapper

#2  0x000000003816225a in do_syscall_for_client (syscallno=454, tst=0x1002018ee0, syscall_mask=0x1002a8de20) at m_syswrap/syswrap-main.c:368

454 is a umtx op

   case VKI_UMTX_OP_WAIT:
      PRINT( "sys__umtx_op ( %#" FMT_REGWORD "x, WAIT, %" FMT_REGWORD "u, %#" FMT_REGWORD "x, %#" FMT_REGWORD "x)", ARG1, ARG3, ARG4, ARG5);
      PRE_REG_READ5(long, "_umtx_op_wait",
                    long *, obj, int, op, unsigned long, val,
                    size_t, timeout_size, struct vki_timespec *, timeout);
      if (ARG1) {
         PRE_MEM_READ( "_umtx_op_wait(val)", ARG1, sizeof(long) );
         if (*(long*)ARG1 == (long)ARG3) {
            *flags |= SfMayBlock;
         }
      }

Thread 1 claims to be here

(gdb) call vgPlain_get_and_pp_StackTrace(0, 10)
==36688==    at 0x0: ???
(gdb) call vgPlain_get_and_pp_StackTrace(1, 10)
==36688==    at 0x4879D2C: ??? (in /lib/libthr.so.3)
==36688==    by 0x487E995: ??? (in /lib/libthr.so.3)
==36688==    by 0x48579F3: pthread_join_intercept (drd_pthread_intercepts.c:764)
==36688==    by 0x48579F3: pthread_join (drd_pthread_intercepts.c:774)
==36688==    by 0x201C95: main (pth_cancel_locked.c:46)

Here:

  /* Join the created thread. */
  pthread_join(tid, 0);

Makes sense, waiting for something to happen

Thread 2 no looking good

(gdb) call vgPlain_get_and_pp_StackTrace(2, 10)
==36688==    at 0x4879D2C: ??? (in /lib/libthr.so.3)
==36688==    by 0x48894BF: ??? (in /lib/libthr.so.3)
==36688==    by 0x487B8FE: ??? (in /lib/libthr.so.3)
==36688==    by 0x485BE36: pthread_cond_wait_intercept (drd_pthread_intercepts.c:1163)
==36688==    by 0x485BE36: pthread_cond_wait (drd_pthread_intercepts.c:1169)
==36688==    by 0x201D11: thread (pth_cancel_locked.c:21)
==36688==    by 0x4862648: vgDrd_thread_wrapper (drd_pthread_intercepts.c:500)
==36688==    by 0x487CA79: ??? (in /lib/libthr.so.3)

That's

static void* thread(void* arg)
{
  /* Lock s_mutex2. */
  pthread_mutex_lock(&s_mutex2);
  /* Inform the main thread that s_mutex2 has been locked, and wait for pthread_cancel(). */
  pthread_mutex_lock(&s_mutex1);
  pthread_cond_signal(&s_cond);
  pthread_cond_wait(&s_cond, &s_mutex1); // HERE
  return 0;
}

This thread is supposed to have been canceled.

Here is a trace on 13.1

SYSCALL[64079,1](455) thr_new ( 0x1ffc000120, 104 ) --> [pre-success] Success(0x0) SYSCALL[64079,2](454) sys__umtx_op ( 0x547e008, MUTEX_WAIT, 0, 0x0, 0x0) --> [async] ... 

SYSCALL[64079,1](454) sys__umtx_op ( 0x547e008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[64079,1](454) sys__umtx_op ( 0x4868168, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ... 
SYSCALL[64079,2](454) ... [async] --> Success(0x0) 
SYSCALL[64079,2](454) sys__umtx_op ( 0x547e008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[64079,2](454) sys__umtx_op ( 0x4e909c8, NWAKE_PRIVATE, 1, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[64079,2](454) sys__umtx_op ( 0x4841008, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ... 
SYSCALL[64079,1](454) ... [async] --> Success(0x0) 
SYSCALL[64079,1](433) sys_thr_kill ( 185043, 32 ) --> [async] ... 
SYSCALL[64079,1](433) ... [async] --> Success(0x0) 
SYSCALL[64079,1](454) sys__umtx_op ( 0x4e90700, WAIT, 185043, 0x0, 0x0) --> [async] ... 
SYSCALL[64079,2](443) sys_thr_wake ( 185043 )[sync] --> Success(0x0) 
SYSCALL[64079,2](443) sys_thr_wake ( 185043 )[sync] --> Success(0x0) 
SYSCALL[64079,2](1000) sys_sigreturn ( 0x1fdbffedb8 ) --> [pre-success] NoWriteResult 
SYSCALL[64079,2](431) sys_thr_exit ( 0x4e90700 ) --> [pre-success] Success(0x0) 
SYSCALL[64079,1](454) ... [async] --> Success(0x0) 
SYSCALL[64079,1](  4) sys_write ( 2, 0x1ffbfffb20, 15 ) --> [async] ... 
Test finished.
SYSCALL[64079,1](  4) ... [async] --> Success(0xf) 
SYSCALL[64079,1](  1) exit( 0 ) --> [pre-success] Success(0x0) 

And 13.2

SYSCALL[38323,1](455) thr_new ( 0x1ffc000890, 104 ) --> [pre-success] Success(0x0) SYSCALL[38323,2](454) sys__umtx_op ( 0x5476008, MUTEX_WAIT, 0, 0x0, 0x0) --> [async] ... 

SYSCALL[38323,1](454) sys__umtx_op ( 0x5476008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[38323,1](454) sys__umtx_op ( 0x48676a8, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ... 
SYSCALL[38323,2](454) ... [async] --> Success(0x0) 
SYSCALL[38323,2](454) sys__umtx_op ( 0x5476008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[38323,2](454) sys__umtx_op ( 0x4e889c8, NWAKE_PRIVATE, 1, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[38323,2](454) sys__umtx_op ( 0x4841008, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ... 
SYSCALL[38323,1](454) ... [async] --> Success(0x0) 
SYSCALL[38323,1](433) sys_thr_kill ( 105531, 32 ) --> [async] ... 
SYSCALL[38323,2](443) sys_thr_wake ( 105531 )[sync] --> Success(0x0) 
SYSCALL[38323,2](443) sys_thr_wake ( 105531 )[sync] --> Success(0x0) 
SYSCALL[38323,2](1000) sys_sigreturn ( 0x1ffaffedb8 ) --> [pre-success] NoWriteResult 
SYSCALL[38323,2](454) sys__umtx_op ( 0x4e88708, MUTEX_WAIT, 0, 0x0, 0x0) --> [async] ... 
SYSCALL[38323,1](433) ... [async] --> Success(0x0) 
SYSCALL[38323,1](454) sys__umtx_op ( 0x4e88708, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0) 
SYSCALL[38323,1](454) sys__umtx_op ( 0x4e88700, WAIT, 105531, 0x0, 0x0) --> [async] ... 
SYSCALL[38323,2](454) ... [async] --> Success(0x0) 
SYSCALL[38323,2](431) sys_thr_exit ( 0x4e88700 ) --> [pre-success] Success(0x0) 
SYSCALL[38323,1](454) ... [async] --> Success(0x0) 
SYSCALL[38323,1](  4) sys_write ( 2, 0x1ffc000290, 15 ) --> [async] ... 
Test finished.
SYSCALL[38323,1](  4) ... [async] --> Success(0xf) 
SYSCALL[38323,1](  1) exit( 0 ) --> [pre-success] Success(0x0) 

(with tracing the TC passes so there must be some order difference when it hangs).

Running the two standalone under ktrace. Both commentaries starting with pthread_create.

On 13.1

Thread 2 runs straight away. I don't think that there is a syscall for locking mutex2. Or the second one. It then signals.

Thread 1 then blocks on the cond var

Thread 2 does a cond wait which seems to consist of UMTX_OP_NWAKE_PRIVATE and UMTX_OP_WAIT_UINT_PRIVATE (don't understand that)

Thread 1 cancels thread 2 (thread 1 wasn't blocked)

Thread 2 is interrupted

Thread 1 waits in join

Thread2 wakes, sigreturns and exits

Thread 1 returns from join prints message and ends

Now how does that compare with 13.2?

First difference After thread2 is created it does not run straight away thread 1 does its cond wait first

OK

And I'm seeing thread 2 cont wait + interrupt much later.

paulfloyd commented 1 year ago

This is a 1 CPU VM. What if two CPUs?

Seems to run OK,

Hmm.

paulfloyd commented 7 months ago

Still there on 13.3

paulfloyd commented 5 months ago

And 14.1