ocaml-multicore / multicoretests

PBT testsuite and libraries for testing multicore OCaml
https://ocaml-multicore.github.io/multicoretests/
BSD 2-Clause "Simplified" License
37 stars 16 forks source link

[ocaml5-issue] Windows trunk bytecode domain_spawntree crash or deadlock #354

Open jmid opened 1 year ago

jmid commented 1 year ago

Today surfaced a Windows trunk bytecode crash on src/domain/domain_spawntree.ml https://github.com/ocaml-multicore/multicoretests/actions/runs/5154525696/jobs/9283085877

random seed: 502502158
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
jmid commented 1 year ago

Found another occurrence of this causing a live/deadlock: https://github.com/ocaml-multicore/multicoretests/actions/runs/5242663626/jobs/9466351902

random seed: 320533040
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)Terminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.
jmid commented 1 year ago

Observed another variant of this on the Mingw Windows 5.0.0 workflow https://github.com/ocaml-multicore/multicoretests/actions/runs/5565429087/job/15072781545

random seed: 221745155
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
Fatal error: no domain lock held
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code 3.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
jmid commented 12 months ago

Crash seen again on Mingw bytecode trunk: https://github.com/ocaml-multicore/multicoretests/actions/runs/6093487146/job/16533243147

random seed: 373304996
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
jmid commented 11 months ago

Just saw this as a deadlock on Mingw 5.1.0~rc3 (native, not bytecode): https://github.com/ocaml-multicore/multicoretests/actions/runs/6160240834/job/16716723779?pr=395

random seed: 238601704
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
[ ]   13    0    0   13 /  100    91.4s domain_spawntree - with Atomic
[ ]   21    0    0   21 /  100   199.6s domain_spawntree - with AtomicTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.
shym commented 8 months ago

Observed on a MSVC-restoring branch (so on current trunk): https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:92

random seed: 529644456
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
Fatal error: Failed to create domain
Fatal error: File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.
[ ]    0    0    0    0 /  [100](https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:101)     0.0s domain_spawntree - with Atomic (generating)
shym commented 8 months ago

Error -1073740791 seems to happen very consistently on the MSVC port, the latest instance being:

random seed: 405994358
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
Fatal error: Failed to create domain
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.

but also with seeds 437567822, 428257872,...

According to MS documentation, -1073740791 (aka 0xc0000409) is:

STATUS_STACK_BUFFER_OVERRUN: The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

and -1073741819 (aka 0xc0000005) is:

STATUS_ACCESS_VIOLATION: The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

Those sound like two nuances of segfaults. Could the differences of error codes bring any light on the cause or, on the contrary, suggest they are separate issues?

shym commented 8 months ago

Debugging this further, it seems that the 0xC0000409 errors I saw on the MSVC port where caused by the abort as tracked in #428. So it would be two different things indeed.