Open jmid opened 5 months ago
I wonder if it is related to #12889, the only recent change to Domain.DLS that I can think of. (I hope not!)
This just triggered again on 32-bit 5.3.0+trunk
by the merge to main
of #460:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9169655398/job/25210472949
random seed: 103830913
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 | (name stm_tests_dls)
^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
Is there more information that we can use to try to investigate this? "There is a segfault somewhere in Domain.DLS on 32bit" is not that much.
First off, this is a collection of failures we observe. Once we have fleshed out reproducible steps, these are reported upstream. Help is very welcome, snarky remarks less so.
"There is a segfault somewhere in Domain.DLS on 32bit" is not that much.
Come on, there are QCheck seeds that caused the failures, GA workflows listing the steps taken, and links to 2 CI run logs, with full information about versions.
Run opam exec -- ocamlc -config
opam exec -- ocamlc -config
opam config list
opam exec -- dune printenv
opam list --columns=name,installed-version,repository,synopsis-or-target
opam clean --all-switches --unused-repositories --logs --download-cache --repo-cache
shell: /usr/bin/bash -e {0}
env:
QCHECK_MSG_INTERVAL: 60
DUNE_PROFILE: dev
OCAMLRUNPARAM:
DUNE_CI_ALIAS: runtest
COMPILER: ocaml-variants.5.3.0+trunk,ocaml-option-32bit
OCAML_COMPILER_GIT_REF: refs/heads/trunk
CUSTOM_COMPILER_VERSION:
CUSTOM_COMPILER_SRC:
CUSTOM_OCAML_PKG_VERSION:
OPAMCLI: 2.0
OPAMCOLOR: always
OPAMERRLOGLEN: 0
OPAMJOBS: 4
OPAMPRECISETRACKING: 1
OPAMSOLVERTIMEOUT: 1000
OPAMYES: 1
DUNE_CACHE: enabled
DUNE_CACHE_TRANSPORT: direct
DUNE_CACHE_STORAGE_MODE: copy
CLICOLOR_FORCE: 1
version: 5.3.0+dev0-2023-12-22
standard_library_default: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
standard_library: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
ccomp_type: cc
c_compiler: gcc -m32
ocamlc_cflags: -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -pthread
ocamlc_cppflags: -D_FILE_OFFSET_BITS=64
ocamlopt_cflags: -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -pthread
ocamlopt_cppflags: -D_FILE_OFFSET_BITS=64
bytecomp_c_compiler: gcc -m32 -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -pthread -D_FILE_OFFSET_BITS=64
native_c_compiler: gcc -m32 -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC -pthread -D_FILE_OFFSET_BITS=64
bytecomp_c_libraries: -lzstd -latomic -lm -lpthread
native_c_libraries: -latomic -lm -lpthread
native_ldflags:
native_pack_linker: ld -r -o
native_compiler: false
architecture: i386
model: default
int_size: 31
word_size: 32
system: linux
asm: i386-linux-as
asm_cfi_supported: false
with_frame_pointers: false
ext_exe:
ext_obj: .o
ext_asm: .s
ext_lib: .a
ext_dll: .so
os_type: Unix
default_executable_name: a.out
systhread_supported: true
host: i386-pc-linux-gnu
target: i386-pc-linux-gnu
flambda: false
safe_string: true
default_safe_string: true
flat_float_array: true
function_sections: false
afl_instrument: false
tsan: false
windows_unicode: false
supports_shared_libraries: true
native_dynlink: false
naked_pointers: false
exec_magic_number: Caml1999X035
cmi_magic_number: Caml1999I035
cmo_magic_number: Caml1999O035
cma_magic_number: Caml1999A035
cmx_magic_number: Caml1999Y035
cmxa_magic_number: Caml1999Z035
ast_impl_magic_number: Caml1999M035
ast_intf_magic_number: Caml1999N035
cmxs_magic_number: Caml1999D035
cmt_magic_number: Caml1999T035
linear_magic_number: Caml1999L035
<><> Global opam variables ><><><><><><><><><><><><><><><><><><><><><><><><><><>
arch x86_64 # Inferred from system
exe # Suffix needed for executable filenames (Windows)
jobs 4 # The number of parallel jobs set up in opam configuration
make make # The 'make' command to use
opam-version 2.1.6 # The currently running opam version
os linux # Inferred from system
os-distribution ubuntu # Inferred from system
os-family debian # Inferred from system
os-version 22.04 # Inferred from system
root /home/runner/.opam # The current opam root directory
switch /home/runner/work/multicoretests/multicoretests # The identifier of the current switch
sys-ocaml-arch # Target architecture of the OCaml compiler present on your system
sys-ocaml-cc # Host C Compiler type of the OCaml compiler present on your system
sys-ocaml-libc # Host C Runtime Library type of the OCaml compiler present on your system
sys-ocaml-version # OCaml version present on your system independently of opam, if any
<><> Configuration variables from the current switch ><><><><><><><><><><><><><>
prefix /home/runner/work/multicoretests/multicoretests/_opam
lib /home/runner/work/multicoretests/multicoretests/_opam/lib
bin /home/runner/work/multicoretests/multicoretests/_opam/bin
sbin /home/runner/work/multicoretests/multicoretests/_opam/sbin
share /home/runner/work/multicoretests/multicoretests/_opam/share
doc /home/runner/work/multicoretests/multicoretests/_opam/doc
etc /home/runner/work/multicoretests/multicoretests/_opam/etc
man /home/runner/work/multicoretests/multicoretests/_opam/man
toplevel /home/runner/work/multicoretests/multicoretests/_opam/lib/toplevel
stublibs /home/runner/work/multicoretests/multicoretests/_opam/lib/stublibs
user runner
group docker
<><> Package variables ('opam var --package PKG' to show) <><><><><><><><><><><>
PKG:name # Name of the package
PKG:version # Version of the package
PKG:depends # Resolved direct dependencies of the package
PKG:installed # Whether the package is installed
PKG:enable # Takes the value "enable" or "disable" depending on whether the package is installed
PKG:pinned # Whether the package is pinned
PKG:bin # Binary directory for this package
PKG:sbin # System binary directory for this package
PKG:lib # Library directory for this package
PKG:man # Man directory for this package
PKG:doc # Doc directory for this package
PKG:share # Share directory for this package
PKG:etc # Etc directory for this package
PKG:build # Directory where the package was built
PKG:hash # Hash of the package archive
PKG:dev # True if this is a development package
PKG:build-id # A hash identifying the precise package version with all its dependencies
PKG:opamfile # Path of the curent opam file
(flags
(-w
@1..3@5..28@30..39@43@46..47@49..57@61..62-40
-strict-sequence
-strict-formats
-short-paths
-keep-locs))
(ocamlc_flags (-g))
(ocamlopt_flags (-g))
(melange.compile_flags (-g))
(c_flags
(-m32
-O2
-fno-strict-aliasing
-fwrapv
-pthread
-fPIC
-pthread
-m32
-D_FILE_OFFSET_BITS=64
-fdiagnostics-color=always))
(cxx_flags
(-x
c++
-m32
-O2
-fno-strict-aliasing
-fwrapv
-pthread
-fPIC
-pthread
-fdiagnostics-color=always))
(link_flags ())
(menhir_flags ())
(menhir_explain ())
(coq_flags (-q))
(coqdoc_flags (--toc))
(js_of_ocaml_flags
(--pretty --source-map-inline))
(js_of_ocaml_build_runtime_flags
(--pretty --source-map-inline))
(js_of_ocaml_link_flags (--source-map-inline))
# Packages matching: installed
# Name # Installed # Repository # Synopsis
base-bigarray base default
base-domains base default
base-nnp base default Naked pointers prohibited in the OCaml heap
base-threads base default
base-unix base default
dune 3.15.2 default Fast, portable, and opinionated build system
ocaml 5.3.0 default The OCaml compiler (virtual package)
ocaml-config 3 default OCaml Switch Configuration
ocaml-option-32bit 1 default Set OCaml to be compiled in 32-bit mode for 64-bit Linux and OS X hosts
ocaml-option-bytecode-only 1 default Compile OCaml without the native-code compiler
ocaml-variants 5.3.0+trunk default Current trunk
qcheck-core 0.21.3 default Core qcheck library
No snark intended, I genuinely wonder how you work with these failures. For example I'm not sure if it is reasonably easy to extract a backtrace, and/or to observe the same failure within the debug runtime. (Is this segfault due to a memory corruption, or an assert failure?)
If you prefer to work on this without upstream looking over your shoulder for now, I am happy to let you work your magic and wait for easier reproduction instructions.
OK, fair enough. Some of these remaining ones are just hard to reproduce - I suspect because they are timing or signal related.
I've been trying today for this one: https://github.com/ocaml-multicore/multicoretests/actions?query=branch%3Adomain-dls-32-bit-focus
I finally managed to reproduce this one - on 5.2.0 - and only once for now. It is indeed a sequential fault! :open_mouth: https://github.com/ocaml-multicore/multicoretests/actions/runs/9180414541/job/25244781126#step:18:762
Starting 74-th run
random seed: 103830913
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
/usr/bin/bash: line 1: 197189 Segmentation fault (core dumped) ./focusedtest.exe -v -s 103830913
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
Switching hard-coded seed to 107236932 (the first one) works much better!
Across 500 repetitions this triggered 56 segfaults on 5.2.0
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424522/job/25248130724
and 50 segfaults on 5.3.0+trunk
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424534/job/25248130821
I've made a bit of progress on this.
trunk
as of this morning on the CII tried running under the debug runtime in the CI to see if an assertion would catch the issue before causing a crash, and indeed it does - 6-8 times out of 200:
random seed: 107236932
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
[00] file runtime/shared_heap.c; line 787 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
/usr/bin/bash: line 1: 394730 Aborted (core dumped) ./focusedtest.exe -v -s 107236932
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
The assertion in question is this one: https://github.com/ocaml/ocaml/blob/23d896786adb694d39785bd7770c537a6d8c6fe6/runtime/shared_heap.c#L787 used to verify the heap at the end of a STW.
I've tried testing previous versions (5.0.0, 5.1.0, 5.1.1) too - both with a hard-coded seed 107236932 and with random ones. Result:
trunk
- triggering in 9-12 cases out of 200How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?
Here is a proposed small patch for example: https://github.com/gasche/ocaml/commits/mysterious-dls-crash-1/
(Another obvious idea would be to revert #12889 and see whether you can still reproduce the crash. I don't see any other relevant change in the runtime, but I also read this change carefully several times without noticing anything that could explain the current observations.)
How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?
I should be able to do that for a feature branch like the proposed one :+1: Thanks for looking into this - I'll keep you posted.
Thanks! This change is really a blind move, so it is unlikely to work. I think the reasonable next step is to revert #12889. Let me know if you need help doing that -- it should revert cleanly from 5.2 in particular, but I haven't actually tried.
No cigar unfortunately: https://github.com/ocaml-multicore/multicoretests/tree/domain-dls-32-bit-gabriel https://github.com/ocaml-multicore/multicoretests/actions/runs/9352909686/job/25742143625
On that one I also saw this (even rarer) non-crashing misbehaviour:
Starting 89-th run
random seed: 135228812
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
[✗] 136 1 0 135 / 1000 0.3s STM Domain.DLS test sequential
=== Error ======================================================================
Test STM Domain.DLS test sequential errored on (21 shrink steps):
Get (-81244680)
exception Invalid_argument("List.nth")
================================================================================
failure (0 tests failed, 1 tests errored, ran 1 tests)
Despite being generated as a number between 0 and length=4
(and not performing any shrinking)
https://github.com/ocaml-multicore/multicoretests/blob/f1533b82640a3c7e26cbf1c1cf9e91427a81ce56/src/domain/stm_tests_dls.ml#L28
the Get
-constructor's argument somehow ends up being -81244680
...
That signals some form of heap corruption - like the assertion failure indicates. What makes you suspect #12889?
My reasoning is that your test exercises the DLS primitives, and started failing in 5.2 and no older realease. #12889 is the only substantial change to the implementation of DLS that happened between 5.1 and 5.2, and it touches an unsafe part of the language (a mix of C runtime code and Obj.magic on the OCaml side). This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?
This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?
With the debug-runtime strategy to trigger this, in the CI I'm now repeating 200 times a QCheck-test with count=1000
- with no hard-coded seeds. That makes for 200.000 arbitrary tests and gives a pretty clear signal.
I've now completed a round of git bisect
CI golf, and the finger points at:
with the latest run available here: https://github.com/ocaml-multicore/multicoretests/actions/runs/9368325592/job/25790072295 Highlights:
file runtime/shared_heap.c; line 778 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
Get 78
Here's the log score-card from the golf round:
I accidentally kicked off a run with an even smaller heap ("s=2048"
).
Among a couple of assertion failures, this triggered the following which confirms my suspecion of a memory corruption (bytecode corruption):
https://github.com/ocaml-multicore/multicoretests/actions/runs/9386404318/job/25846922781
Starting 179-th run
random seed: 517273910
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
Fatal error: bad opcode (d701d6d7)
/usr/bin/bash: line 1: 415035 Aborted (core dumped) ./focusedtest.exe -v
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
I've tried to run under gdb batch mode in the CI: https://github.com/ocaml-multicore/multicoretests/actions/runs/9388010576/job/25852264722
For the 6 assertion failures this doesn't add much new info:
Starting 49-th run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
random seed: 246561918
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
[00] file runtime/shared_heap.c; line 778 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
Thread 1 "focusedtest.exe" received signal SIGABRT, Aborted.
0xf7fc4579 in __kernel_vsyscall ()
This is a failure in verify_object
called from caml_verify_heap_from_stw
:
void caml_verify_heap_from_stw(caml_domain_state *domain) {
struct heap_verify_state* st = caml_verify_begin();
caml_do_roots (&caml_verify_root, verify_scanning_flags, st, domain, 1);
caml_scan_global_roots(&caml_verify_root, st);
while (st->sp) verify_object(st, st->stack[--st->sp]);
caml_addrmap_clear(&st->seen);
caml_stat_free(st->stack);
caml_stat_free(st);
}
For the 2 clean segfaults it reveals a little:
Starting 39-th run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
random seed: 66516876
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential
571 runtime/interp.c: No such file or directory.
[ ] 0 0 0 0 / 1000 0.0s STM Domain.DLS test sequential (generating)
Thread 1 "focusedtest.exe" received signal SIGSEGV, Segmentation fault.
0x565a8d88 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:571
with the crash happening in the pc = Code_val(accu);
line:
Instruct(RETURN): {
sp += *pc++;
if (extra_args > 0) {
extra_args--;
pc = Code_val(accu);
env = accu;
Next;
} else {
goto do_return;
}
}
Isn't the common theme to both of these "stack corruption"? :thinking:
I've dug some more into this issue.
Experiments reveal that this still can trigger without split_from_parent
and omitting either Get
or Set
commands entirely. This indicates that there is an issue not directly tied to either of these (there may be another... :shrug: )
Using tmate I've also managed to log into the GitHub action runner machines, reproduce crashes there, and observe backtraces.
A backtrace for an assertion failure run with each thread annotated with its role:
Thread 4 (Thread 0xf03ffac0 (LWP 172957)): ##### A waiting backup thread for the child domain
#0 0xf1a6c579 in __kernel_vsyscall ()
#1 0xf1683243 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xf168a06a in pthread_mutex_lock () from /lib/i386-linux-gnu/libc.so.6
#3 0x5922ab80 in caml_plat_lock_blocking (m=0x5a9df4c0) at runtime/caml/platform.h:457
#4 backup_thread_func (v=<optimized out>) at runtime/domain.c:1076
#5 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#6 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Thread 3 (Thread 0xf1880740 (LWP 171712)): ##### Main thread paused during blocked C_CALL2 to caml_ml_condition_wait
#0 0xf1a6c579 in __kernel_vsyscall ()
#1 0xf1715336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xf1682e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3 0xf1686079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4 0x592572d1 in sync_condvar_wait (m=0x5a9e3920, c=0x5a9e1620) at runtime/sync_posix.h:116
#5 caml_ml_condition_wait (wcond=<optimized out>, wmut=<optimized out>) at runtime/sync.c:172
#6 0x5925dce2 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:1047
#7 0x59261052 in caml_startup_code_exn (pooling=0, argv=0xffdf2e24, section_table_size=3683, section_table=0x59292020 <caml_sections> "\204\225\246\276", data_size=21404, data=0x59292ea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x59298240 <caml_code>) at runtime/startup_byt.c:655
#8 caml_startup_code_exn (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:588
#9 0x59261101 in caml_startup_code (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:669
#10 0x592120b4 in main (argc=4, argv=0xffdf2e24) at camlprim.c:25901
Thread 2 (Thread 0xee0f6ac0 (LWP 172956)): ##### Child domain thread triggering major GC slice on MAKEBLOCK2
#0 0x592512cc in caml_verify_root (state=0xf10ae180, v=-245978988, p=0xf11f7410) at runtime/shared_heap.c:759
#1 0x5923013d in caml_scan_stack (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, stack=0xf117f010, v_gc_regs=0x0) at runtime/fiber.c:396
#2 0x5924f826 in caml_do_local_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, local_roots=0xee0f61ec, current_stack=0xf117f010, v_gc_regs=0x0) at runtime/roots.c:65
#3 0x5924f865 in caml_do_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, d=0xf0402620, do_final_val=1) at runtime/roots.c:41
#4 0x5925343e in caml_verify_heap_from_stw (domain=0xf0402620) at runtime/shared_heap.c:804
#5 0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c:1434
#6 0x5922af41 in caml_try_run_on_all_domains_with_spin_work (sync=<optimized out>, handler=<optimized out>, data=<optimized out>, leader_setup=<optimized out>, enter_spin_callback=<optimized out>, enter_spin_data=<optimized out>) at runtime/domain.c:1695
#7 0x5922b10a in caml_try_run_on_all_domains (handler=0x592407c0 <stw_cycle_all_domains>, data=0xee0f5ca8, leader_setup=0x0) at runtime/domain.c:1717
#8 0x5924324e in major_collection_slice (howmuch=<optimized out>, participant_count=participant_count@entry=0, barrier_participants=barrier_participants@entry=0x0, mode=<optimized out>, force_compaction=<optimized out>) at runtime/major_gc.c:1851
#9 0x59243670 in caml_major_collection_slice (howmuch=-1) at runtime/major_gc.c:1869
#10 0x5922a7d8 in caml_poll_gc_work () at runtime/domain.c:1874
#11 0x59254e67 in caml_do_pending_actions_res () at runtime/signals.c:338
#12 0x5924cc9c in caml_alloc_small_dispatch (dom_st=0xf0402620, wosize=2, flags=3, nallocs=1, encoded_alloc_lens=0x0) at runtime/minor_gc.c:896
#13 0x5925f1f9 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:788
#14 0x59224cbc in caml_callbackN_exn (closure=<optimized out>, narg=<optimized out>, args=<optimized out>) at runtime/callback.c:131
#15 0x59224faa in caml_callback_exn (arg1=<optimized out>, closure=<optimized out>) at runtime/callback.c:144
#16 caml_callback_res (closure=-243007372, arg=1) at runtime/callback.c:320
#17 0x59229e4a in domain_thread_func (v=<optimized out>) at runtime/domain.c:1244
#18 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#19 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Thread 1 (Thread 0xef3feac0 (LWP 171715)): ##### Main backup thread participating in STW
#0 caml_failed_assert (expr=0x5926bf18 "Has_status_val(v, caml_global_heap_state.UNMARKED)", file_os=0x5926b995 "runtime/shared_heap.c", line=784) at runtime/misc.c:48
#1 0x59253709 in verify_object (v=-298832060, st=0xf1074c70) at runtime/shared_heap.c:784
#2 caml_verify_heap_from_stw (domain=0x5a9e0120) at runtime/shared_heap.c:807
#3 0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c: 1434
#4 0x5922aa28 in stw_handler (domain=0x5a9e0120) at runtime/domain.c:1486
#5 handle_incoming (s=<optimized out>) at runtime/domain.c:351
#6 0x5922ac9a in caml_handle_incoming_interrupts () at runtime/domain.c:364
#7 backup_thread_func (v=<optimized out>) at runtime/domain.c:1057
#8 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#9 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6
The bisection pointing at https://github.com/ocaml/ocaml/pull/12193 combined with stw_cycle_all_domains
in the above made me suspect 32-bit relevant changes in that PR, but I've not been able to find anything so far.
Here's another backtrace, this one from a pure seg fault run:
Thread 2 (Thread 0xf17feac0 (LWP 122715)):
#0 0xf3f47579 in __kernel_vsyscall ()
#1 0xf3b15336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2 0xf3a82e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3 0xf3a86079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4 0x648daa9d in caml_plat_wait (cond=0x64d693b4, mut=0x64d6939c) at runtime/platform.c:127
#5 0x648b6c1a in backup_thread_func (v=<optimized out>) at runtime/domain.c:1068
#6 0xf3a86c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#7 0xf3b2372c in ?? () from /lib/i386-linux-gnu/libc.so.6
Thread 1 (Thread 0xf3d5b740 (LWP 122712)): ##### Segfault during RETURN instruction in pc = Code_val(accu);
#0 0x648ea29f in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:573
#1 0x648ed052 in caml_startup_code_exn (pooling=0, argv=0xff8ca1e4, section_table_size=3683, section_table=0x6491e020 <caml_sections> "\204\225\246\276", data_size=21404, data=
0x6491eea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x64924240 <caml_code>) at runtime/startup_byt.c:655
#2 caml_startup_code_exn (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <caml_section
s> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:588
#3 0x648ed101 in caml_startup_code (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <ca
ml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:669
#4 0x6489e0b4 in main (argc=4, argv=0xff8ca1e4) at camlprim.c:25901
The common theme is still stack memory corruption (caml_scan_stack and RETURN). The issue also seems restricted to trunk and 5.2 (incl. compaction and https://github.com/ocaml/ocaml/pull/12889).
I've still not been able to reproduce locally.
I've finally managed to reproduce this locally by using musl
.
It turns out, this is not restricted to 32-bit but reproduces with a musl
bytecode compiler.
I've shared repro steps on the upstream issue: https://github.com/ocaml/ocaml/issues/13402
In the CI-run for #445 on 32-bit trunk the
STM Domain.DLS test sequential
triggered a segfault https://github.com/ocaml-multicore/multicoretests/actions/runs/8436771284/job/23104952265?pr=445This may be another case of a 32-bit/bytecode issue showing up in a couple of different tests:
440
412
Surprisingly this case however triggered in a sequential (single-domain) test! :open_mouth: