ocaml-multicore / multicoretests

PBT testsuite and libraries for testing multicore OCaml
https://ocaml-multicore.github.io/multicoretests/
BSD 2-Clause "Simplified" License
37 stars 16 forks source link

[ocaml5-issue] Abort / crash on thread_joingraph and thread_createtree under debug runtime #353

Closed jmid closed 11 months ago

jmid commented 1 year ago

Yesterday, I saw an abort on src/thread/thread_joingraph using the debug runtime build on trunk: https://github.com/ocaml-multicore/multicoretests/actions/runs/5135424605/jobs/9240823033

random seed: 24146066
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work
[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 6, characters 7-23:
6 |  (name thread_joingraph)
           ^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_joingraph.exe --verbose)
Command got signal ABRT.
[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work (generating)

After having built a fresh local trunk switch, I was just able to recreate locally in a loop:

$ while OCAMLRUNPARAM="v=1,V=1" _build/default/src/thread/thread_joingraph.exe --verbose -s 24146066; do :; done

[... 12 iterations go by without error]

### OCaml runtime: debug mode ###
random seed: 24146066
generated error fail pass / total     time test name
[ ]   79    0    0   79 /  100    66.9s Thread.create/join - tak work[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
Aborted (core dumped)
jmid commented 1 year ago

Spotted this again on Linux debug runtime trunk 5.2 https://github.com/ocaml-multicore/multicoretests/actions/runs/5224077321/jobs/9431819887

random seed: 381509558
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work
[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 6, characters 7-23:
6 |  (name thread_joingraph)
           ^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_joingraph.exe --verbose)
Command got signal ABRT.
[ ]    0    0    0    0 /  100     0.0s Thread.create/join - tak work (generating)
jmid commented 1 year ago

On the branch https://github.com/ocaml-multicore/multicoretests/tree/unify-thread-domain where I'm playing with a reusable Work module, I can reproduce this pretty reliably:

$ dune build src/thread/thread_joingraph.exe --profile=debug-runtime
$ OCAMLRUNPARAM="s=1024,v=1,V=1" _build/default/src/thread/thread_joingraph.exe -v -s 107463665
### OCaml runtime: debug mode ###
random seed: 107463665
generated error fail pass / total     time test name
[ ]   22    0    0   22 /  200     6.8s Thread.create/join[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
Aborted (core dumped)

gdb reports the following backtrace:

[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
[Thread 0x7fffde1bc640 (LWP 2022924) exited]
[Thread 0x7fffdd1ba640 (LWP 2022926) exited]
[Thread 0x7ffff61ec640 (LWP 2022879) exited]
[Thread 0x7fffdc9b9640 (LWP 2022927) exited]

Thread 353 "thread_joingrap" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff59eb640 (LWP 2022880)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737314207296, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00005555556e6770 in caml_failed_assert (expr=expr@entry=0x5555556f9310 "(uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger", 
    file_os=file_os@entry=0x5555556f8e59 "runtime/domain.c", line=line@entry=1526) at runtime/misc.c:56
#6  0x00005555556cd1d8 in caml_reset_young_limit (dom_st=<optimized out>) at runtime/domain.c:1526
#7  0x00005555556ce43b in caml_poll_gc_work () at runtime/domain.c:1647
#8  0x00005555556eb991 in caml_do_pending_actions_exn () at runtime/signals.c:308
#9  0x00005555556eba91 in caml_process_pending_actions_with_root_exn (root=<optimized out>) at runtime/signals.c:342
#10 0x00005555556ebc52 in caml_process_pending_actions_with_root (root=1) at runtime/signals.c:351
#11 caml_process_pending_actions () at runtime/signals.c:362
#12 <signal handler called>
#13 0x0000555555610f76 in camlWork.tak_1052 () at src/work.ml:41
#14 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#15 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#16 0x0000555555610f26 in camlWork.tak_1052 () at src/work.ml:41
#17 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#18 0x0000555555610f42 in camlWork.tak_1052 () at src/work.ml:41
#19 0x0000555555610f26 in camlWork.tak_1052 () at src/work.ml:41
#20 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#21 0x0000555555610f42 in camlWork.tak_1052 () at src/work.ml:41
#22 0x00005555556110b4 in camlWork.run_1125 () at src/work.ml:54
#23 0x000055555564eb07 in camlThread.fun_843 () at thread.ml:48
#24 <signal handler called>
#25 0x00005555556cabc8 in caml_callback_exn (closure=<optimized out>, closure@entry=140737347848384, arg=<optimized out>, arg@entry=1)
    at runtime/callback.c:197
#26 0x00005555556bae19 in caml_thread_start (v=<optimized out>) at st_stubs.c:552
#27 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#28 0x00007ffff7d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
jmid commented 1 year ago

On a Linux 5.2/trunk debug run of #409 we triggered a related error - but this time in thread_createtree: https://github.com/ocaml-multicore/multicoretests/actions/runs/6800766105/job/18490073637

random seed: 242456814
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s thread_createtree - with Atomic
[00] file runtime/domain.c; line 1621 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 14, characters 7-24:
14 |  (name thread_createtree)
            ^^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_createtree.exe --verbose)
Command got signal ABRT.
[ ]    0    0    0    0 / 1000     0.0s thread_createtree - with Atomic (generating)
jmid commented 1 year ago

This is fixed in ocaml/ocaml/pull/12742 for trunk but will still show up in 5.1 tests