ocaml-multicore / multicoretests

PBT testsuite and libraries for testing multicore OCaml
https://ocaml-multicore.github.io/multicoretests/
BSD 2-Clause "Simplified" License
37 stars 16 forks source link

[ocaml5-issue] Segfault in `STM Domain.DLS test sequential` on bytecode #446

Open jmid opened 5 months ago

jmid commented 5 months ago

In the CI-run for #445 on 32-bit trunk the STM Domain.DLS test sequential triggered a segfault https://github.com/ocaml-multicore/multicoretests/actions/runs/8436771284/job/23104952265?pr=445

random seed: 107236932
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

This may be another case of a 32-bit/bytecode issue showing up in a couple of different tests:

Surprisingly this case however triggered in a sequential (single-domain) test! :open_mouth:

gasche commented 5 months ago

I wonder if it is related to #12889, the only recent change to Domain.DLS that I can think of. (I hope not!)

jmid commented 3 months ago

This just triggered again on 32-bit 5.3.0+trunk by the merge to main of #460: https://github.com/ocaml-multicore/multicoretests/actions/runs/9169655398/job/25210472949

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
gasche commented 3 months ago

Is there more information that we can use to try to investigate this? "There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

jmid commented 3 months ago

First off, this is a collection of failures we observe. Once we have fleshed out reproducible steps, these are reported upstream. Help is very welcome, snarky remarks less so.

"There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

Come on, there are QCheck seeds that caused the failures, GA workflows listing the steps taken, and links to 2 CI run logs, with full information about versions.

Run opam exec -- ocamlc -config
  opam exec -- ocamlc -config
  opam config list
  opam exec -- dune printenv
  opam list --columns=name,installed-version,repository,synopsis-or-target
  opam clean --all-switches --unused-repositories --logs --download-cache --repo-cache
  shell: /usr/bin/bash -e {0}
  env:
    QCHECK_MSG_INTERVAL: 60
    DUNE_PROFILE: dev
    OCAMLRUNPARAM: 
    DUNE_CI_ALIAS: runtest
    COMPILER: ocaml-variants.5.3.0+trunk,ocaml-option-32bit
    OCAML_COMPILER_GIT_REF: refs/heads/trunk
    CUSTOM_COMPILER_VERSION: 
    CUSTOM_COMPILER_SRC: 
    CUSTOM_OCAML_PKG_VERSION: 
    OPAMCLI: 2.0
    OPAMCOLOR: always
    OPAMERRLOGLEN: 0
    OPAMJOBS: 4
    OPAMPRECISETRACKING: 1
    OPAMSOLVERTIMEOUT: 1000
    OPAMYES: 1
    DUNE_CACHE: enabled
    DUNE_CACHE_TRANSPORT: direct
    DUNE_CACHE_STORAGE_MODE: copy
    CLICOLOR_FORCE: 1
version: 5.3.0+dev0-2023-12-22
standard_library_default: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
standard_library: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
ccomp_type: cc
c_compiler: gcc -m32
ocamlc_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlc_cppflags:  -D_FILE_OFFSET_BITS=64 
ocamlopt_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlopt_cppflags:  -D_FILE_OFFSET_BITS=64 
bytecomp_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
native_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
bytecomp_c_libraries: -lzstd  -latomic -lm  -lpthread
native_c_libraries:  -latomic -lm  -lpthread
native_ldflags: 
native_pack_linker: ld -r -o 
native_compiler: false
architecture: i386
model: default
int_size: 31
word_size: 32
system: linux
asm: i386-linux-as
asm_cfi_supported: false
with_frame_pointers: false
ext_exe: 
ext_obj: .o
ext_asm: .s
ext_lib: .a
ext_dll: .so
os_type: Unix
default_executable_name: a.out
systhread_supported: true
host: i386-pc-linux-gnu
target: i386-pc-linux-gnu
flambda: false
safe_string: true
default_safe_string: true
flat_float_array: true
function_sections: false
afl_instrument: false
tsan: false
windows_unicode: false
supports_shared_libraries: true
native_dynlink: false
naked_pointers: false
exec_magic_number: Caml1999X035
cmi_magic_number: Caml1999I035
cmo_magic_number: Caml1999O035
cma_magic_number: Caml1999A035
cmx_magic_number: Caml1999Y035
cmxa_magic_number: Caml1999Z035
ast_impl_magic_number: Caml1999M035
ast_intf_magic_number: Caml1999N035
cmxs_magic_number: Caml1999D035
cmt_magic_number: Caml1999T035
linear_magic_number: Caml1999L035

<><> Global opam variables ><><><><><><><><><><><><><><><><><><><><><><><><><><>
arch              x86_64                                          # Inferred from system
exe                                                               # Suffix needed for executable filenames (Windows)
jobs              4                                               # The number of parallel jobs set up in opam configuration
make              make                                            # The 'make' command to use
opam-version      2.1.6                                           # The currently running opam version
os                linux                                           # Inferred from system
os-distribution   ubuntu                                          # Inferred from system
os-family         debian                                          # Inferred from system
os-version        22.04                                           # Inferred from system
root              /home/runner/.opam                              # The current opam root directory
switch            /home/runner/work/multicoretests/multicoretests # The identifier of the current switch
sys-ocaml-arch                                                    # Target architecture of the OCaml compiler present on your system
sys-ocaml-cc                                                      # Host C Compiler type of the OCaml compiler present on your system
sys-ocaml-libc                                                    # Host C Runtime Library type of the OCaml compiler present on your system
sys-ocaml-version                                                 # OCaml version present on your system independently of opam, if any

<><> Configuration variables from the current switch ><><><><><><><><><><><><><>
prefix   /home/runner/work/multicoretests/multicoretests/_opam
lib      /home/runner/work/multicoretests/multicoretests/_opam/lib
bin      /home/runner/work/multicoretests/multicoretests/_opam/bin
sbin     /home/runner/work/multicoretests/multicoretests/_opam/sbin
share    /home/runner/work/multicoretests/multicoretests/_opam/share
doc      /home/runner/work/multicoretests/multicoretests/_opam/doc
etc      /home/runner/work/multicoretests/multicoretests/_opam/etc
man      /home/runner/work/multicoretests/multicoretests/_opam/man
toplevel /home/runner/work/multicoretests/multicoretests/_opam/lib/toplevel
stublibs /home/runner/work/multicoretests/multicoretests/_opam/lib/stublibs
user     runner
group    docker

<><> Package variables ('opam var --package PKG' to show) <><><><><><><><><><><>
PKG:name       # Name of the package
PKG:version    # Version of the package
PKG:depends    # Resolved direct dependencies of the package
PKG:installed  # Whether the package is installed
PKG:enable     # Takes the value "enable" or "disable" depending on whether the package is installed
PKG:pinned     # Whether the package is pinned
PKG:bin        # Binary directory for this package
PKG:sbin       # System binary directory for this package
PKG:lib        # Library directory for this package
PKG:man        # Man directory for this package
PKG:doc        # Doc directory for this package
PKG:share      # Share directory for this package
PKG:etc        # Etc directory for this package
PKG:build      # Directory where the package was built
PKG:hash       # Hash of the package archive
PKG:dev        # True if this is a development package
PKG:build-id   # A hash identifying the precise package version with all its dependencies
PKG:opamfile   # Path of the curent opam file
(flags
 (-w
  @1..3@5..28@30..39@43@46..47@49..57@61..62-40
  -strict-sequence
  -strict-formats
  -short-paths
  -keep-locs))
(ocamlc_flags (-g))
(ocamlopt_flags (-g))
(melange.compile_flags (-g))
(c_flags
 (-m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -m32
  -D_FILE_OFFSET_BITS=64
  -fdiagnostics-color=always))
(cxx_flags
 (-x
  c++
  -m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -fdiagnostics-color=always))
(link_flags ())
(menhir_flags ())
(menhir_explain ())
(coq_flags (-q))
(coqdoc_flags (--toc))
(js_of_ocaml_flags
 (--pretty --source-map-inline))
(js_of_ocaml_build_runtime_flags
 (--pretty --source-map-inline))
(js_of_ocaml_link_flags (--source-map-inline))
# Packages matching: installed
# Name                     # Installed # Repository # Synopsis
base-bigarray              base        default
base-domains               base        default
base-nnp                   base        default      Naked pointers prohibited in the OCaml heap
base-threads               base        default
base-unix                  base        default
dune                       3.15.2      default      Fast, portable, and opinionated build system
ocaml                      5.3.0       default      The OCaml compiler (virtual package)
ocaml-config               3           default      OCaml Switch Configuration
ocaml-option-32bit         1           default      Set OCaml to be compiled in 32-bit mode for 64-bit Linux and OS X hosts
ocaml-option-bytecode-only 1           default      Compile OCaml without the native-code compiler
ocaml-variants             5.3.0+trunk default      Current trunk
qcheck-core                0.21.3      default      Core qcheck library
gasche commented 3 months ago

No snark intended, I genuinely wonder how you work with these failures. For example I'm not sure if it is reasonably easy to extract a backtrace, and/or to observe the same failure within the debug runtime. (Is this segfault due to a memory corruption, or an assert failure?)

If you prefer to work on this without upstream looking over your shoulder for now, I am happy to let you work your magic and wait for easier reproduction instructions.

jmid commented 3 months ago

OK, fair enough. Some of these remaining ones are just hard to reproduce - I suspect because they are timing or signal related.

I've been trying today for this one: https://github.com/ocaml-multicore/multicoretests/actions?query=branch%3Adomain-dls-32-bit-focus

jmid commented 3 months ago

I finally managed to reproduce this one - on 5.2.0 - and only once for now. It is indeed a sequential fault! :open_mouth: https://github.com/ocaml-multicore/multicoretests/actions/runs/9180414541/job/25244781126#step:18:762

Starting 74-th run

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
/usr/bin/bash: line 1: 197189 Segmentation fault      (core dumped) ./focusedtest.exe -v -s 103830913
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
jmid commented 3 months ago

Switching hard-coded seed to 107236932 (the first one) works much better! Across 500 repetitions this triggered 56 segfaults on 5.2.0 https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424522/job/25248130724 and 50 segfaults on 5.3.0+trunk https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424534/job/25248130821

jmid commented 3 months ago

I've made a bit of progress on this.

The assertion in question is this one: https://github.com/ocaml/ocaml/blob/23d896786adb694d39785bd7770c537a6d8c6fe6/runtime/shared_heap.c#L787 used to verify the heap at the end of a STW.

jmid commented 3 months ago

I've tried testing previous versions (5.0.0, 5.1.0, 5.1.1) too - both with a hard-coded seed 107236932 and with random ones. Result:

gasche commented 3 months ago

How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?

gasche commented 3 months ago

Here is a proposed small patch for example: https://github.com/gasche/ocaml/commits/mysterious-dls-crash-1/

(Another obvious idea would be to revert #12889 and see whether you can still reproduce the crash. I don't see any other relevant change in the runtime, but I also read this change carefully several times without noticing anything that could explain the current observations.)

jmid commented 3 months ago

How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?

I should be able to do that for a feature branch like the proposed one :+1: Thanks for looking into this - I'll keep you posted.

gasche commented 3 months ago

Thanks! This change is really a blind move, so it is unlikely to work. I think the reasonable next step is to revert #12889. Let me know if you need help doing that -- it should revert cleanly from 5.2 in particular, but I haven't actually tried.

jmid commented 3 months ago

No cigar unfortunately: https://github.com/ocaml-multicore/multicoretests/tree/domain-dls-32-bit-gabriel https://github.com/ocaml-multicore/multicoretests/actions/runs/9352909686/job/25742143625

On that one I also saw this (even rarer) non-crashing misbehaviour:

Starting 89-th run

random seed: 135228812
generated error fail pass / total     time test name
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
[✗]  136    1    0  135 / 1000     0.3s STM Domain.DLS test sequential
=== Error ======================================================================
Test STM Domain.DLS test sequential errored on (21 shrink steps):
   Get (-81244680)
exception Invalid_argument("List.nth")
================================================================================
failure (0 tests failed, 1 tests errored, ran 1 tests)

Despite being generated as a number between 0 and length=4 (and not performing any shrinking) https://github.com/ocaml-multicore/multicoretests/blob/f1533b82640a3c7e26cbf1c1cf9e91427a81ce56/src/domain/stm_tests_dls.ml#L28 the Get-constructor's argument somehow ends up being -81244680...

That signals some form of heap corruption - like the assertion failure indicates. What makes you suspect #12889?

gasche commented 3 months ago

My reasoning is that your test exercises the DLS primitives, and started failing in 5.2 and no older realease. #12889 is the only substantial change to the implementation of DLS that happened between 5.1 and 5.2, and it touches an unsafe part of the language (a mix of C runtime code and Obj.magic on the OCaml side). This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?

jmid commented 3 months ago

This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?

With the debug-runtime strategy to trigger this, in the CI I'm now repeating 200 times a QCheck-test with count=1000 - with no hard-coded seeds. That makes for 200.000 arbitrary tests and gives a pretty clear signal.

jmid commented 3 months ago

I've now completed a round of git bisect CI golf, and the finger points at:

with the latest run available here: https://github.com/ocaml-multicore/multicoretests/actions/runs/9368325592/job/25790072295 Highlights:

Here's the log score-card from the golf round:

jmid commented 3 months ago

I accidentally kicked off a run with an even smaller heap ("s=2048"). Among a couple of assertion failures, this triggered the following which confirms my suspecion of a memory corruption (bytecode corruption): https://github.com/ocaml-multicore/multicoretests/actions/runs/9386404318/job/25846922781

Starting 179-th run

random seed: 517273910
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
Fatal error: bad opcode (d701d6d7)
/usr/bin/bash: line 1: 415035 Aborted                 (core dumped) ./focusedtest.exe -v
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
jmid commented 3 months ago

I've tried to run under gdb batch mode in the CI: https://github.com/ocaml-multicore/multicoretests/actions/runs/9388010576/job/25852264722

For the 6 assertion failures this doesn't add much new info:

Starting 49-th run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

random seed: 246561918
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
[00] file runtime/shared_heap.c; line 778 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
Thread 1 "focusedtest.exe" received signal SIGABRT, Aborted.
0xf7fc4579 in __kernel_vsyscall ()

This is a failure in verify_object called from caml_verify_heap_from_stw:

void caml_verify_heap_from_stw(caml_domain_state *domain) {
  struct heap_verify_state* st = caml_verify_begin();
  caml_do_roots (&caml_verify_root, verify_scanning_flags, st, domain, 1);
  caml_scan_global_roots(&caml_verify_root, st);

  while (st->sp) verify_object(st, st->stack[--st->sp]);

  caml_addrmap_clear(&st->seen);
  caml_stat_free(st->stack);
  caml_stat_free(st);
}

For the 2 clean segfaults it reveals a little:

Starting 39-th run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

random seed: 66516876
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
571 runtime/interp.c: No such file or directory.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
Thread 1 "focusedtest.exe" received signal SIGSEGV, Segmentation fault.
0x565a8d88 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:571

with the crash happening in the pc = Code_val(accu); line:

    Instruct(RETURN): {
      sp += *pc++;
      if (extra_args > 0) {
        extra_args--;
        pc = Code_val(accu);
        env = accu;
        Next;
      } else {
        goto do_return;
      }
    }

Isn't the common theme to both of these "stack corruption"? :thinking:

jmid commented 1 month ago

I've dug some more into this issue.

Experiments reveal that this still can trigger without split_from_parent and omitting either Get or Set commands entirely. This indicates that there is an issue not directly tied to either of these (there may be another... :shrug: )

jmid commented 1 month ago

Using tmate I've also managed to log into the GitHub action runner machines, reproduce crashes there, and observe backtraces.

A backtrace for an assertion failure run with each thread annotated with its role:

Thread 4 (Thread 0xf03ffac0 (LWP 172957)):  ##### A waiting backup thread for the child domain
#0  0xf1a6c579 in __kernel_vsyscall ()
#1  0xf1683243 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2  0xf168a06a in pthread_mutex_lock () from /lib/i386-linux-gnu/libc.so.6
#3  0x5922ab80 in caml_plat_lock_blocking (m=0x5a9df4c0) at runtime/caml/platform.h:457
#4  backup_thread_func (v=<optimized out>) at runtime/domain.c:1076
#5  0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#6  0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6

Thread 3 (Thread 0xf1880740 (LWP 171712)):  ##### Main thread paused during blocked C_CALL2 to caml_ml_condition_wait
#0  0xf1a6c579 in __kernel_vsyscall ()
#1  0xf1715336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2  0xf1682e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3  0xf1686079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4  0x592572d1 in sync_condvar_wait (m=0x5a9e3920, c=0x5a9e1620) at runtime/sync_posix.h:116
#5  caml_ml_condition_wait (wcond=<optimized out>, wmut=<optimized out>) at runtime/sync.c:172
#6  0x5925dce2 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:1047
#7  0x59261052 in caml_startup_code_exn (pooling=0, argv=0xffdf2e24, section_table_size=3683, section_table=0x59292020 <caml_sections> "\204\225\246\276", data_size=21404, data=0x59292ea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x59298240 <caml_code>) at runtime/startup_byt.c:655
#8  caml_startup_code_exn (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:588
#9  0x59261101 in caml_startup_code (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:669
#10 0x592120b4 in main (argc=4, argv=0xffdf2e24) at camlprim.c:25901

Thread 2 (Thread 0xee0f6ac0 (LWP 172956)):  ##### Child domain thread triggering major GC slice on MAKEBLOCK2
#0  0x592512cc in caml_verify_root (state=0xf10ae180, v=-245978988, p=0xf11f7410) at runtime/shared_heap.c:759
#1  0x5923013d in caml_scan_stack (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, stack=0xf117f010, v_gc_regs=0x0) at runtime/fiber.c:396
#2  0x5924f826 in caml_do_local_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, local_roots=0xee0f61ec, current_stack=0xf117f010, v_gc_regs=0x0) at runtime/roots.c:65
#3  0x5924f865 in caml_do_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, d=0xf0402620, do_final_val=1) at runtime/roots.c:41
#4  0x5925343e in caml_verify_heap_from_stw (domain=0xf0402620) at runtime/shared_heap.c:804
#5  0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c:1434
#6  0x5922af41 in caml_try_run_on_all_domains_with_spin_work (sync=<optimized out>, handler=<optimized out>, data=<optimized out>, leader_setup=<optimized out>, enter_spin_callback=<optimized out>, enter_spin_data=<optimized out>) at runtime/domain.c:1695
#7  0x5922b10a in caml_try_run_on_all_domains (handler=0x592407c0 <stw_cycle_all_domains>, data=0xee0f5ca8, leader_setup=0x0) at runtime/domain.c:1717
#8  0x5924324e in major_collection_slice (howmuch=<optimized out>, participant_count=participant_count@entry=0, barrier_participants=barrier_participants@entry=0x0, mode=<optimized out>, force_compaction=<optimized out>) at runtime/major_gc.c:1851
#9  0x59243670 in caml_major_collection_slice (howmuch=-1) at runtime/major_gc.c:1869
#10 0x5922a7d8 in caml_poll_gc_work () at runtime/domain.c:1874
#11 0x59254e67 in caml_do_pending_actions_res () at runtime/signals.c:338
#12 0x5924cc9c in caml_alloc_small_dispatch (dom_st=0xf0402620, wosize=2, flags=3, nallocs=1, encoded_alloc_lens=0x0) at runtime/minor_gc.c:896
#13 0x5925f1f9 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:788
#14 0x59224cbc in caml_callbackN_exn (closure=<optimized out>, narg=<optimized out>, args=<optimized out>) at runtime/callback.c:131
#15 0x59224faa in caml_callback_exn (arg1=<optimized out>, closure=<optimized out>) at runtime/callback.c:144
#16 caml_callback_res (closure=-243007372, arg=1) at runtime/callback.c:320
#17 0x59229e4a in domain_thread_func (v=<optimized out>) at runtime/domain.c:1244
#18 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#19 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6

Thread 1 (Thread 0xef3feac0 (LWP 171715)): ##### Main backup thread participating in STW
#0  caml_failed_assert (expr=0x5926bf18 "Has_status_val(v, caml_global_heap_state.UNMARKED)", file_os=0x5926b995 "runtime/shared_heap.c", line=784) at runtime/misc.c:48
#1  0x59253709 in verify_object (v=-298832060, st=0xf1074c70) at runtime/shared_heap.c:784
#2  caml_verify_heap_from_stw (domain=0x5a9e0120) at runtime/shared_heap.c:807
#3  0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c: 1434
#4  0x5922aa28 in stw_handler (domain=0x5a9e0120) at runtime/domain.c:1486
#5  handle_incoming (s=<optimized out>) at runtime/domain.c:351
#6  0x5922ac9a in caml_handle_incoming_interrupts () at runtime/domain.c:364
#7  backup_thread_func (v=<optimized out>) at runtime/domain.c:1057
#8  0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#9  0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6

The bisection pointing at https://github.com/ocaml/ocaml/pull/12193 combined with stw_cycle_all_domains in the above made me suspect 32-bit relevant changes in that PR, but I've not been able to find anything so far.

Here's another backtrace, this one from a pure seg fault run:

Thread 2 (Thread 0xf17feac0 (LWP 122715)):
#0  0xf3f47579 in __kernel_vsyscall ()
#1  0xf3b15336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2  0xf3a82e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3  0xf3a86079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4  0x648daa9d in caml_plat_wait (cond=0x64d693b4, mut=0x64d6939c) at runtime/platform.c:127
#5  0x648b6c1a in backup_thread_func (v=<optimized out>) at runtime/domain.c:1068
#6  0xf3a86c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#7  0xf3b2372c in ?? () from /lib/i386-linux-gnu/libc.so.6

Thread 1 (Thread 0xf3d5b740 (LWP 122712)):  ##### Segfault during RETURN instruction in pc = Code_val(accu);
#0  0x648ea29f in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:573
#1  0x648ed052 in caml_startup_code_exn (pooling=0, argv=0xff8ca1e4, section_table_size=3683, section_table=0x6491e020 <caml_sections> "\204\225\246\276", data_size=21404, data=
0x6491eea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x64924240 <caml_code>) at runtime/startup_byt.c:655
#2  caml_startup_code_exn (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <caml_section
s> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:588
#3  0x648ed101 in caml_startup_code (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <ca
ml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:669
#4  0x6489e0b4 in main (argc=4, argv=0xff8ca1e4) at camlprim.c:25901

The common theme is still stack memory corruption (caml_scan_stack and RETURN). The issue also seems restricted to trunk and 5.2 (incl. compaction and https://github.com/ocaml/ocaml/pull/12889).

I've still not been able to reproduce locally.

jmid commented 2 days ago

I've finally managed to reproduce this locally by using musl. It turns out, this is not restricted to 32-bit but reproduces with a musl bytecode compiler. I've shared repro steps on the upstream issue: https://github.com/ocaml/ocaml/issues/13402