Closed jmid closed 11 months ago
Spotted this again on Linux debug runtime trunk 5.2 https://github.com/ocaml-multicore/multicoretests/actions/runs/5224077321/jobs/9431819887
random seed: 381509558
generated error fail pass / total time test name
[ ] 0 0 0 0 / 100 0.0s Thread.create/join - tak work
[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 6, characters 7-23:
6 | (name thread_joingraph)
^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_joingraph.exe --verbose)
Command got signal ABRT.
[ ] 0 0 0 0 / 100 0.0s Thread.create/join - tak work (generating)
On the branch https://github.com/ocaml-multicore/multicoretests/tree/unify-thread-domain where I'm playing with a reusable Work
module, I can reproduce this pretty reliably:
$ dune build src/thread/thread_joingraph.exe --profile=debug-runtime
$ OCAMLRUNPARAM="s=1024,v=1,V=1" _build/default/src/thread/thread_joingraph.exe -v -s 107463665
### OCaml runtime: debug mode ###
random seed: 107463665
generated error fail pass / total time test name
[ ] 22 0 0 22 / 200 6.8s Thread.create/join[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
Aborted (core dumped)
gdb reports the following backtrace:
[00] file runtime/domain.c; line 1526 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
[Thread 0x7fffde1bc640 (LWP 2022924) exited]
[Thread 0x7fffdd1ba640 (LWP 2022926) exited]
[Thread 0x7ffff61ec640 (LWP 2022879) exited]
[Thread 0x7fffdc9b9640 (LWP 2022927) exited]
Thread 353 "thread_joingrap" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff59eb640 (LWP 2022880)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737314207296) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737314207296, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff7c287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00005555556e6770 in caml_failed_assert (expr=expr@entry=0x5555556f9310 "(uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger",
file_os=file_os@entry=0x5555556f8e59 "runtime/domain.c", line=line@entry=1526) at runtime/misc.c:56
#6 0x00005555556cd1d8 in caml_reset_young_limit (dom_st=<optimized out>) at runtime/domain.c:1526
#7 0x00005555556ce43b in caml_poll_gc_work () at runtime/domain.c:1647
#8 0x00005555556eb991 in caml_do_pending_actions_exn () at runtime/signals.c:308
#9 0x00005555556eba91 in caml_process_pending_actions_with_root_exn (root=<optimized out>) at runtime/signals.c:342
#10 0x00005555556ebc52 in caml_process_pending_actions_with_root (root=1) at runtime/signals.c:351
#11 caml_process_pending_actions () at runtime/signals.c:362
#12 <signal handler called>
#13 0x0000555555610f76 in camlWork.tak_1052 () at src/work.ml:41
#14 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#15 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#16 0x0000555555610f26 in camlWork.tak_1052 () at src/work.ml:41
#17 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#18 0x0000555555610f42 in camlWork.tak_1052 () at src/work.ml:41
#19 0x0000555555610f26 in camlWork.tak_1052 () at src/work.ml:41
#20 0x0000555555610f5e in camlWork.tak_1052 () at src/work.ml:41
#21 0x0000555555610f42 in camlWork.tak_1052 () at src/work.ml:41
#22 0x00005555556110b4 in camlWork.run_1125 () at src/work.ml:54
#23 0x000055555564eb07 in camlThread.fun_843 () at thread.ml:48
#24 <signal handler called>
#25 0x00005555556cabc8 in caml_callback_exn (closure=<optimized out>, closure@entry=140737347848384, arg=<optimized out>, arg@entry=1)
at runtime/callback.c:197
#26 0x00005555556bae19 in caml_thread_start (v=<optimized out>) at st_stubs.c:552
#27 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#28 0x00007ffff7d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
On a Linux 5.2/trunk debug run of #409 we triggered a related error - but this time in thread_createtree
:
https://github.com/ocaml-multicore/multicoretests/actions/runs/6800766105/job/18490073637
random seed: 242456814
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s thread_createtree - with Atomic
[00] file runtime/domain.c; line 1621 ### Assertion failed: (uintnat)dom_st->young_ptr > (uintnat)dom_st->young_trigger
File "src/thread/dune", line 14, characters 7-24:
14 | (name thread_createtree)
^^^^^^^^^^^^^^^^^
(cd _build/default/src/thread && ./thread_createtree.exe --verbose)
Command got signal ABRT.
[ ] 0 0 0 0 / 1000 0.0s thread_createtree - with Atomic (generating)
This is fixed in ocaml/ocaml/pull/12742 for trunk
but will still show up in 5.1 tests
Yesterday, I saw an abort on
src/thread/thread_joingraph
using the debug runtime build ontrunk
: https://github.com/ocaml-multicore/multicoretests/actions/runs/5135424605/jobs/9240823033After having built a fresh local trunk switch, I was just able to recreate locally in a loop: