janestreet / async_unix

Jane Street Capital's asynchronous execution library (unix)
MIT License
33 stars 21 forks source link

Possible deadlock in Async? #19

Closed mbacarella closed 4 years ago

mbacarella commented 4 years ago

I'm running into what seems like a deadlock in Async somewhere.

Begin big copy/paste of what happens when you attach gdb to the deadlocked process and run info threads and bt all of the threads.

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7fd1befe8780 (LWP 26796) "my_exe" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x29b06d8 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  2    Thread 0x7fd1b6b60700 (LWP 26799) "close" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x29b06d8 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7fd1b635f700 (LWP 26800) "my_exe" 0x00007fd1bd26213f in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0,
    timeout=0x7fd1b635ee10) at ../sysdeps/unix/sysv/linux/select.c:41
  4    Thread 0x7fd1b5b5e700 (LWP 26801) "read" 0x00007fd1bdcf7394 in __libc_read (fd=fd@entry=0, buf=buf@entry=0x7fd1ac002ea0, nbytes=nbytes@entry=32768)
    at ../sysdeps/unix/sysv/linux/read.c:27
  5    Thread 0x7fd1b535d700 (LWP 26802) "writev" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x29b06d8 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  6    Thread 0x7fd1b4b3b700 (LWP 26803) "writev" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd1ac002bd8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  7    Thread 0x7fd1a7fff700 (LWP 26804) "stat" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd1ac0232bc)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  8    Thread 0x7fd1a77fe700 (LWP 26805) "mkstemp" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd19c002bac)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  9    Thread 0x7fd1a691e700 (LWP 26806) "writev" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd1b0002bf8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  10   Thread 0x7fd1a5a10700 (LWP 26807) "openfile" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x29b06d8 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  11   Thread 0x7fd191eff700 (LWP 26808) "fcntl_getfl" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd19c003018)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  12   Thread 0x7fd1a4b66700 (LWP 26828) "writev" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x37e6328)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  13   Thread 0x7fd193fff700 (LWP 26829) "writev" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x29b06d8 <caml_master_lock+88>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  14   Thread 0x7fd1937fe700 (LWP 6148) "writev" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x37e6cc8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  15   Thread 0x7fd192ffd700 (LWP 12163) "fstat" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd194008be8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  16   Thread 0x7fd1927fc700 (LWP 12164) "fcntl_getfl" 0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd178000bb8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88

Thread 16 (Thread 0x7fd1927fc700 (LWP 12164)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd178000bb8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd178000bf0, cond=0x7fd178000b90) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd178000b90, mutex=0x7fd178000bf0) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1927fc700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 15 (Thread 0x7fd192ffd700 (LWP 12163)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd194008be8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd194000bb0, cond=0x7fd194008bc0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd194008bc0, mutex=0x7fd194000bb0) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd192ffd700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 14 (Thread 0x7fd1937fe700 (LWP 6148)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x37e6cc8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd1ac023230, cond=0x37e6ca0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x37e6ca0, mutex=0x7fd1ac023230) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1937fe700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 13 (Thread 0x7fd193fff700 (LWP 26829)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29b06d8 <caml_master_lock+88>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x29b0680 <caml_master_lock>, cond=0x29b06b0 <caml_master_lock+48>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x29b06b0 <caml_master_lock+48>, mutex=0x29b0680 <caml_master_lock>) at pthread_cond_wait.c:655
#3  0x0000000001c60e02 in st_masterlock_acquire.constprop ()
#4  0x0000000001c60e49 in caml_thread_leave_blocking_section ()
#5  0x0000000001c6b756 in caml_leave_blocking_section () at signals.c:167
#6  0x0000000001c588cd in writev_in_blocking_section (count=1, iovecs=<optimized out>, v_iovecs=140538696229184, v_fd=<optimized out>) at bigstring_stubs.c:553
#7  bigstring_writev_stub (v_fd=<optimized out>, v_iovecs=<optimized out>, v_count=<optimized out>) at bigstring_stubs.c:562
#8  0x00000000018d94da in camlCore__Bigstring__writev_1234 () at src/bigstring.ml:205
#9  0x00000000017b53e4 in camlAsync_unix__Syscall__loop_68 () at src/syscall.ml:19
#10 0x0000000001b650d5 in camlBase__Result__try_with_499 () at src/result.ml:179
#11 0x00000000017c9432 in camlAsync_unix__In_thread__doit_113 () at src/in_thread.ml:27
#12 0x0000000001841082 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:448
#13 0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#14 0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#15 0x0000000001c86f42 in caml_start_program ()
#16 0x0000000001c60d8c in caml_thread_start ()
#17 0x00007fd1bdced6db in start_thread (arg=0x7fd193fff700) at pthread_create.c:463
#18 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 12 (Thread 0x7fd1a4b66700 (LWP 26828)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x37e6328) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd1ac01af60, cond=0x37e6300) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x37e6300, mutex=0x7fd1ac01af60) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
---Type <return> to continue, or q <return> to quit---
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1a4b66700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 11 (Thread 0x7fd191eff700 (LWP 26808)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd19c003018) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd19c00b150, cond=0x7fd19c002ff0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd19c002ff0, mutex=0x7fd19c00b150) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd191eff700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 10 (Thread 0x7fd1a5a10700 (LWP 26807)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29b06d8 <caml_master_lock+88>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x29b0680 <caml_master_lock>, cond=0x29b06b0 <caml_master_lock+48>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x29b06b0 <caml_master_lock+48>, mutex=0x29b0680 <caml_master_lock>) at pthread_cond_wait.c:655
#3  0x0000000001c60e02 in st_masterlock_acquire.constprop ()
#4  0x0000000001c60e49 in caml_thread_leave_blocking_section ()
#5  0x0000000001c6b756 in caml_leave_blocking_section () at signals.c:167
#6  0x0000000001c61b78 in caml_condition_wait ()
#7  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#8  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#9  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#10 0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#11 0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#12 0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#13 0x0000000001c86f42 in caml_start_program ()
#14 0x0000000001c60d8c in caml_thread_start ()
#15 0x00007fd1bdced6db in start_thread (arg=0x7fd1a5a10700) at pthread_create.c:463
#16 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7fd1a691e700 (LWP 26806)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd1b0002bf8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd1b0002ba0, cond=0x7fd1b0002bd0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd1b0002bd0, mutex=0x7fd1b0002ba0) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1a691e700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7fd1a77fe700 (LWP 26805)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd19c002bac) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd19c002b50, cond=0x7fd19c002b80) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd19c002b80, mutex=0x7fd19c002b50) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1a77fe700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7fd1a7fff700 (LWP 26804)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd1ac0232bc) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd1ac023260, cond=0x7fd1ac023290) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd1ac023290, mutex=0x7fd1ac023260) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1a7fff700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7fd1b4b3b700 (LWP 26803)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fd1ac002bd8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7fd1ac002b80, cond=0x7fd1ac002bb0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fd1ac002bb0, mutex=0x7fd1ac002b80) at pthread_cond_wait.c:655
#3  0x0000000001c61b71 in caml_condition_wait ()
#4  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#5  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#6  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#7  0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#8  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#9  0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#10 0x0000000001c86f42 in caml_start_program ()
#11 0x0000000001c60d8c in caml_thread_start ()
#12 0x00007fd1bdced6db in start_thread (arg=0x7fd1b4b3b700) at pthread_create.c:463
#13 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7fd1b535d700 (LWP 26802)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29b06d8 <caml_master_lock+88>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x29b0680 <caml_master_lock>, cond=0x29b06b0 <caml_master_lock+48>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x29b06b0 <caml_master_lock+48>, mutex=0x29b0680 <caml_master_lock>) at pthread_cond_wait.c:655
#3  0x0000000001c60e02 in st_masterlock_acquire.constprop ()
#4  0x0000000001c60e49 in caml_thread_leave_blocking_section ()
#5  0x0000000001c6b756 in caml_leave_blocking_section () at signals.c:167
#6  0x0000000001c588cd in writev_in_blocking_section (count=1, iovecs=<optimized out>, v_iovecs=140538696227504, v_fd=<optimized out>) at bigstring_stubs.c:553
#7  bigstring_writev_stub (v_fd=<optimized out>, v_iovecs=<optimized out>, v_count=<optimized out>) at bigstring_stubs.c:562
#8  0x00000000018d94da in camlCore__Bigstring__writev_1234 () at src/bigstring.ml:205
#9  0x00000000017b53e4 in camlAsync_unix__Syscall__loop_68 () at src/syscall.ml:19
#10 0x0000000001b650d5 in camlBase__Result__try_with_499 () at src/result.ml:179
#11 0x00000000017c9432 in camlAsync_unix__In_thread__doit_113 () at src/in_thread.ml:27
#12 0x0000000001841082 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:448
#13 0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#14 0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#15 0x0000000001c86f42 in caml_start_program ()
#16 0x0000000001c60d8c in caml_thread_start ()
#17 0x00007fd1bdced6db in start_thread (arg=0x7fd1b535d700) at pthread_create.c:463
#18 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fd1b5b5e700 (LWP 26801)):
#0  0x00007fd1bdcf7394 in __libc_read (fd=fd@entry=0, buf=buf@entry=0x7fd1ac002ea0, nbytes=nbytes@entry=32768) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x0000000001c57466 in read (__nbytes=32768, __buf=0x7fd1ac002ea0, __fd=0) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#2  bigstring_read_stub (v_min_len=<optimized out>, v_fd=<optimized out>, v_pos=<optimized out>, v_len=<optimized out>, v_bstr=140537421550048) at bigstring_stubs.c:136
#3  0x00000000018d8ad6 in camlCore__Bigstring__read_inner_619 () at src/bigstring.ml:40
#4  0x0000000001807a71 in camlAsync_unix__Reader0__anon_fn$5bsrc$2freader0$2eml$3a299$2c49$2d$2d162$5d_4049 () at src/reader0.ml:300
#5  0x00000000017b53e4 in camlAsync_unix__Syscall__loop_68 () at src/syscall.ml:19
#6  0x0000000001b650d5 in camlBase__Result__try_with_499 () at src/result.ml:179
#7  0x00000000017c9432 in camlAsync_unix__In_thread__doit_113 () at src/in_thread.ml:27
#8  0x0000000001841082 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:448
#9  0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#10 0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#11 0x0000000001c86f42 in caml_start_program ()
#12 0x0000000001c60d8c in caml_thread_start ()
#13 0x00007fd1bdced6db in start_thread (arg=0x7fd1b5b5e700) at pthread_create.c:463
#14 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fd1b635f700 (LWP 26800)):
#0  0x00007fd1bd26213f in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7fd1b635ee10) at ../sysdeps/unix/sysv/linux/select.c:41
---Type <return> to continue, or q <return> to quit---
#1  0x0000000001c60929 in caml_thread_tick ()
#2  0x00007fd1bdced6db in start_thread (arg=0x7fd1b635f700) at pthread_create.c:463
#3  0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fd1b6b60700 (LWP 26799)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29b06d8 <caml_master_lock+88>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x29b0680 <caml_master_lock>, cond=0x29b06b0 <caml_master_lock+48>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x29b06b0 <caml_master_lock+48>, mutex=0x29b0680 <caml_master_lock>) at pthread_cond_wait.c:655
#3  0x0000000001c60e02 in st_masterlock_acquire.constprop ()
#4  0x0000000001c60e49 in caml_thread_leave_blocking_section ()
#5  0x0000000001c6b756 in caml_leave_blocking_section () at signals.c:167
#6  0x0000000001c61b78 in caml_condition_wait ()
#7  0x000000000184363c in camlSqueue__wait_not_empty_373 () at condition.ml:18
#8  0x00000000018438d1 in camlSqueue__run_511 () at squeue/src/squeue.ml:97
#9  0x0000000001b62cf1 in camlBase__Exn__protectx_306 () at src/exn.ml:57
#10 0x0000000001840fc6 in camlThread_pool__loop_7572 () at thread_pool/src/thread_pool.ml:437
#11 0x00000000018fce06 in camlCore__Core_thread__f_181 () at src/core_thread.ml:16
#12 0x00000000019038a9 in camlThread__anon_fn$5bthread$2eml$3a37$2c4$2d$2d146$5d_55 () at thread.ml:39
#13 0x0000000001c86f42 in caml_start_program ()
#14 0x0000000001c60d8c in caml_thread_start ()
#15 0x00007fd1bdced6db in start_thread (arg=0x7fd1b6b60700) at pthread_create.c:463
#16 0x00007fd1bd26ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7fd1befe8780 (LWP 26796)):
#0  0x00007fd1bdcf39f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29b06d8 <caml_master_lock+88>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x29b0680 <caml_master_lock>, cond=0x29b06b0 <caml_master_lock+48>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x29b06b0 <caml_master_lock+48>, mutex=0x29b0680 <caml_master_lock>) at pthread_cond_wait.c:655
#3  0x0000000001c60e02 in st_masterlock_acquire.constprop ()
#4  0x0000000001c60e49 in caml_thread_leave_blocking_section ()
#5  0x0000000001c6b756 in caml_leave_blocking_section () at signals.c:167
#6  0x0000000001c617e3 in caml_thread_yield ()
#7  0x00000000017c79e9 in camlAsync_unix__Raw_scheduler__check_file_descr_watcher_8000 () at thread.ml:26
#8  0x00000000017c7faf in camlAsync_unix__Raw_scheduler__one_iter_8380 () at src/raw_scheduler.ml:865
#9  0x00000000017c832d in camlAsync_unix__Raw_scheduler__loop_8428 () at src/raw_scheduler.ml:881
#10 0x00000000017c80e9 in camlAsync_unix__Raw_scheduler__be_the_scheduler_inner_8415 () at src/raw_scheduler.ml:885
#11 0x00000000015badaa in camlMy_foo__anon_fn$5basync_command$2fsrc$2fasync_command$2eml$3a46$2c21$2d$2d1183$5d_59271 () at async_command/src/async_command.ml:75
#12 0x00000000015d1769 in camlMy_foo__anon_fn$5bsrc$2fcommand$2eml$3a2931$2c33$2d$2d659$5d_19757 () at src/command.ml:2937
#13 0x00000000015cd195 in camlMy_foo__run_3502 () at src/exn.ml:102
#14 0x00000000010a1b58 in camlDune__exe__my_exe__entry () at my_exe.ml:2
#15 0x0000000001090a09 in caml_program ()
#16 0x0000000001c86f42 in caml_start_program ()
#17 0x0000000001c6a3c5 in caml_startup_common (argv=0x7ffc6c461788, pooling=<optimized out>, pooling@entry=0) at startup_nat.c:160
#18 0x0000000001c6a41b in caml_startup_exn (argv=<optimized out>) at startup_nat.c:165
#19 caml_startup (argv=<optimized out>) at startup_nat.c:170
#20 0x0000000001c6a442 in caml_main (argv=<optimized out>) at startup_nat.c:177
#21 0x000000000108e1dc in main (argc=<optimized out>, argv=<optimized out>) at main.c:44

This is compiled against jane v0.13.0, OCaml 4.08 flambda.

It's in a long-running process and the deadlock happens pretty unpredictably so I'm a bit slow to iterate the debugging.

The most interesting part is the program is long-running but watches stdin for input using a Reader.read_line and when the deadlock happens, the only thread that's not blocked on waiting for the OCaml runtime lock is that Reader.read_line. Instead, it's blocked in libc read, and hitting Enter actually resumes everything.

Working backwards, that libc read appears to be Bigstring.read, suggesting it's both blocked in a libc read and holding the runtime lock. Though, looking at the code, it's not obvious how it could actually acquire the lock but still block in read. This code seems pretty tight:

CAMLprim value bigstring_read_stub(
  value v_min_len, value v_fd, value v_pos, value v_len, value v_bstr)
{
  CAMLparam1(v_bstr);
    size_t min_len = Long_val(v_min_len);
    int fd = Int_val(v_fd);
    size_t init_len = Long_val(v_len);
    size_t len = init_len;
    ssize_t n_read;
    char *bstr_start = get_bstr(v_bstr, v_pos);
    char *bstr = bstr_start;
    char *bstr_min = bstr_start + min_len;
    caml_enter_blocking_section();
      do {
        do {
          n_read = read(fd, bstr, len);
        } while (n_read == -1 && errno == EINTR);
        if (n_read <= 0) {
          value v_n_good = Val_long(bstr - bstr_start);
          caml_leave_blocking_section();
          if (n_read == 0) {
            if (init_len == 0) CAMLreturn(Val_long(0));
            else raise_eof_io_error(v_n_good);
          }
          else raise_unix_io_error(v_n_good, "read", Nothing);
        } else {
          bstr += n_read;
          len -= n_read;
        }
      } while (bstr < bstr_min);
    caml_leave_blocking_section();
  CAMLreturn(Val_long(bstr - bstr_start));
}

As said above, hitting Enter will actually resume all of the threads that are blocked waiting on the OCaml runtime/masterlock waits. Confirmed it by looking at the call stacks again in gdb.

Speculating wildly: You could imagine contriving a situation where the inner while loop exits due to some arcane errno condition, the blocking section is left (acquiring the runtime lock), but the raises somehow don't terminate the C function and it instead nexts the outer loop. Then the read is re-entered while holding the runtime lock, and the only way out is to press Enter (assuming the same thread can call leave blocking section twice). Though that would be really contrived.

I don't mean to overly implicate what must be a very well worn and tested Bigstring.read, just seems like the only lead so far.

Anyway, to make this even more strange, these deadlocks don't appear at all on Ubuntu 16.04 but do appear on Ubuntu 18.04.

Thanks for any insight you can offer.

mbacarella commented 4 years ago

The "wild speculation" part is probably not a lead. I tried a toy program that attempts to have the same thread acquire the master lock (leave blocking section) twice and it hangs trying to lock the second time.

mbacarella commented 4 years ago

This may be due to an OS bug. More discussion here: https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime/6542/5

mbacarella commented 4 years ago

This may actually be a still unpatched bug in glibc.

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

yminsky commented 4 years ago

That's some impressive detective work! Thanks mbac!

seliopou commented 4 years ago

Looks like there's plenty of action to be taken related to this, but luckily not within async-related repos. Folks have been notified of the issue internally, so hopefully we can avoid this glibc version.

Thanks for reporting this, and for your persistence in tracking down the ultimate cause.

mbacarella commented 4 years ago

Thanks!

@seliopou To be clear, the way to avoid it is to not upgrade (or to patch your glibc yourself). The bug was introduced into glibc 2.27 about 4 years ago and persists to this day.