Open paulfloyd opened 1 year ago
In gdb if I interrupt when it has hanged it is here
│ 139 2: syscall // GDB on next line means hang is HERE
│ > 144 setc 0(%rsp) /* stash returned carry flag */
in the asm syscall wrapper
#2 0x000000003816225a in do_syscall_for_client (syscallno=454, tst=0x1002018ee0, syscall_mask=0x1002a8de20) at m_syswrap/syswrap-main.c:368
454 is a umtx op
case VKI_UMTX_OP_WAIT:
PRINT( "sys__umtx_op ( %#" FMT_REGWORD "x, WAIT, %" FMT_REGWORD "u, %#" FMT_REGWORD "x, %#" FMT_REGWORD "x)", ARG1, ARG3, ARG4, ARG5);
PRE_REG_READ5(long, "_umtx_op_wait",
long *, obj, int, op, unsigned long, val,
size_t, timeout_size, struct vki_timespec *, timeout);
if (ARG1) {
PRE_MEM_READ( "_umtx_op_wait(val)", ARG1, sizeof(long) );
if (*(long*)ARG1 == (long)ARG3) {
*flags |= SfMayBlock;
}
}
Thread 1 claims to be here
(gdb) call vgPlain_get_and_pp_StackTrace(0, 10)
==36688== at 0x0: ???
(gdb) call vgPlain_get_and_pp_StackTrace(1, 10)
==36688== at 0x4879D2C: ??? (in /lib/libthr.so.3)
==36688== by 0x487E995: ??? (in /lib/libthr.so.3)
==36688== by 0x48579F3: pthread_join_intercept (drd_pthread_intercepts.c:764)
==36688== by 0x48579F3: pthread_join (drd_pthread_intercepts.c:774)
==36688== by 0x201C95: main (pth_cancel_locked.c:46)
Here:
/* Join the created thread. */
pthread_join(tid, 0);
Makes sense, waiting for something to happen
Thread 2 no looking good
(gdb) call vgPlain_get_and_pp_StackTrace(2, 10)
==36688== at 0x4879D2C: ??? (in /lib/libthr.so.3)
==36688== by 0x48894BF: ??? (in /lib/libthr.so.3)
==36688== by 0x487B8FE: ??? (in /lib/libthr.so.3)
==36688== by 0x485BE36: pthread_cond_wait_intercept (drd_pthread_intercepts.c:1163)
==36688== by 0x485BE36: pthread_cond_wait (drd_pthread_intercepts.c:1169)
==36688== by 0x201D11: thread (pth_cancel_locked.c:21)
==36688== by 0x4862648: vgDrd_thread_wrapper (drd_pthread_intercepts.c:500)
==36688== by 0x487CA79: ??? (in /lib/libthr.so.3)
That's
static void* thread(void* arg)
{
/* Lock s_mutex2. */
pthread_mutex_lock(&s_mutex2);
/* Inform the main thread that s_mutex2 has been locked, and wait for pthread_cancel(). */
pthread_mutex_lock(&s_mutex1);
pthread_cond_signal(&s_cond);
pthread_cond_wait(&s_cond, &s_mutex1); // HERE
return 0;
}
This thread is supposed to have been canceled.
Here is a trace on 13.1
SYSCALL[64079,1](455) thr_new ( 0x1ffc000120, 104 ) --> [pre-success] Success(0x0) SYSCALL[64079,2](454) sys__umtx_op ( 0x547e008, MUTEX_WAIT, 0, 0x0, 0x0) --> [async] ...
SYSCALL[64079,1](454) sys__umtx_op ( 0x547e008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[64079,1](454) sys__umtx_op ( 0x4868168, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ...
SYSCALL[64079,2](454) ... [async] --> Success(0x0)
SYSCALL[64079,2](454) sys__umtx_op ( 0x547e008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[64079,2](454) sys__umtx_op ( 0x4e909c8, NWAKE_PRIVATE, 1, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[64079,2](454) sys__umtx_op ( 0x4841008, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ...
SYSCALL[64079,1](454) ... [async] --> Success(0x0)
SYSCALL[64079,1](433) sys_thr_kill ( 185043, 32 ) --> [async] ...
SYSCALL[64079,1](433) ... [async] --> Success(0x0)
SYSCALL[64079,1](454) sys__umtx_op ( 0x4e90700, WAIT, 185043, 0x0, 0x0) --> [async] ...
SYSCALL[64079,2](443) sys_thr_wake ( 185043 )[sync] --> Success(0x0)
SYSCALL[64079,2](443) sys_thr_wake ( 185043 )[sync] --> Success(0x0)
SYSCALL[64079,2](1000) sys_sigreturn ( 0x1fdbffedb8 ) --> [pre-success] NoWriteResult
SYSCALL[64079,2](431) sys_thr_exit ( 0x4e90700 ) --> [pre-success] Success(0x0)
SYSCALL[64079,1](454) ... [async] --> Success(0x0)
SYSCALL[64079,1]( 4) sys_write ( 2, 0x1ffbfffb20, 15 ) --> [async] ...
Test finished.
SYSCALL[64079,1]( 4) ... [async] --> Success(0xf)
SYSCALL[64079,1]( 1) exit( 0 ) --> [pre-success] Success(0x0)
And 13.2
SYSCALL[38323,1](455) thr_new ( 0x1ffc000890, 104 ) --> [pre-success] Success(0x0) SYSCALL[38323,2](454) sys__umtx_op ( 0x5476008, MUTEX_WAIT, 0, 0x0, 0x0) --> [async] ...
SYSCALL[38323,1](454) sys__umtx_op ( 0x5476008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[38323,1](454) sys__umtx_op ( 0x48676a8, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ...
SYSCALL[38323,2](454) ... [async] --> Success(0x0)
SYSCALL[38323,2](454) sys__umtx_op ( 0x5476008, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[38323,2](454) sys__umtx_op ( 0x4e889c8, NWAKE_PRIVATE, 1, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[38323,2](454) sys__umtx_op ( 0x4841008, CV_WAIT_UINT_PRIVATE, 0, 0x0, 0x0) --> [async] ...
SYSCALL[38323,1](454) ... [async] --> Success(0x0)
SYSCALL[38323,1](433) sys_thr_kill ( 105531, 32 ) --> [async] ...
SYSCALL[38323,2](443) sys_thr_wake ( 105531 )[sync] --> Success(0x0)
SYSCALL[38323,2](443) sys_thr_wake ( 105531 )[sync] --> Success(0x0)
SYSCALL[38323,2](1000) sys_sigreturn ( 0x1ffaffedb8 ) --> [pre-success] NoWriteResult
SYSCALL[38323,2](454) sys__umtx_op ( 0x4e88708, MUTEX_WAIT, 0, 0x0, 0x0) --> [async] ...
SYSCALL[38323,1](433) ... [async] --> Success(0x0)
SYSCALL[38323,1](454) sys__umtx_op ( 0x4e88708, MUTEX_WAKE2, 0, 0x0, 0x0)[sync] --> Success(0x0)
SYSCALL[38323,1](454) sys__umtx_op ( 0x4e88700, WAIT, 105531, 0x0, 0x0) --> [async] ...
SYSCALL[38323,2](454) ... [async] --> Success(0x0)
SYSCALL[38323,2](431) sys_thr_exit ( 0x4e88700 ) --> [pre-success] Success(0x0)
SYSCALL[38323,1](454) ... [async] --> Success(0x0)
SYSCALL[38323,1]( 4) sys_write ( 2, 0x1ffc000290, 15 ) --> [async] ...
Test finished.
SYSCALL[38323,1]( 4) ... [async] --> Success(0xf)
SYSCALL[38323,1]( 1) exit( 0 ) --> [pre-success] Success(0x0)
(with tracing the TC passes so there must be some order difference when it hangs).
Running the two standalone under ktrace. Both commentaries starting with pthread_create.
On 13.1
Thread 2 runs straight away. I don't think that there is a syscall for locking mutex2. Or the second one. It then signals.
Thread 1 then blocks on the cond var
Thread 2 does a cond wait which seems to consist of UMTX_OP_NWAKE_PRIVATE and UMTX_OP_WAIT_UINT_PRIVATE (don't understand that)
Thread 1 cancels thread 2 (thread 1 wasn't blocked)
Thread 2 is interrupted
Thread 1 waits in join
Thread2 wakes, sigreturns and exits
Thread 1 returns from join prints message and ends
Now how does that compare with 13.2?
First difference After thread2 is created it does not run straight away thread 1 does its cond wait first
OK
And I'm seeing thread 2 cont wait + interrupt much later.
This is a 1 CPU VM. What if two CPUs?
Seems to run OK,
Hmm.
Still there on 13.3
And 14.1
Dang!
Does it hang standalone? No.
Looks like a race? With tracing, no hang.