Closed jmid closed 9 months ago
I had previously observed a problem shrinking Out_channel
tests.
That this happens without any shrinking, indicates a runtime issue instead :thinking:
Also spotted on macOS 5.0.0 https://github.com/ocaml-multicore/multicoretests/actions/runs/4542931038/jobs/8006976703?pr=316
random seed: 423960020
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s Lin DSL In_channel test with Domain
[ ] 0 0 0 0 / 1000 0.0s Lin DSL In_channel test with Domain (generating)
[✓] 15 0 1 14 / 1000 13.8s Lin DSL In_channel test with Domain
[ ] 0 0 0 0 / 1000 0.0s Lin DSL Out_channel test with Domain
[ ] 61 0 0 61 / 1000 227.6s Lin DSL Out_channel test with Domain
[ ] 90 0 0 90 / 1000 9110.8s Lin DSL Out_channel test with Domain
[ ] 141 0 0 141 / 1000 9190.7s Lin DSL Out_channel test with Domain
[ ] 203 0 0 203 / 1000 9427.5s Lin DSL Out_channel test with Domain
[✓] 226 0 1 225 / 1000 9440.3s Lin DSL Out_channel test with Domain
This has a jump of nearly 9000s on testing 29 inputs (from 61 to 90)
I tried giving ocaml/ocaml#11903 a spin to see whether it would solve this issue. Apparently not: https://github.com/ocaml-multicore/multicoretests/actions/runs/4688674935/jobs/8309467768
random seed: 478012744
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s Lin DSL In_channel test with Domain
[ ] 0 0 0 0 / 1000 0.0s Lin DSL In_channel test with Domain (generating)
[✓] 131 0 1 130 / 1000 17.8s Lin DSL In_channel test with Domain
[ ] 0 0 0 0 / 1000 0.0s Lin DSL Out_channel test with Domain
[ ] 23 0 0 23 / 1000 466.5s Lin DSL Out_channel test with Domain
[ ] 135 0 0 135 / 1000 586.1s Lin DSL Out_channel test with Domain
[ ] 199 0 0 199 / 1000 646.2s Lin DSL Out_channel test with Domain
[ ] 348 0 0 348 / 1000 715.3s Lin DSL Out_channel test with Domain
[ ] 349 0 0 349 / 1000 2769.2s Lin DSL Out_channel test with Domain
[ ] 491 0 0 491 / 1000 2830.5s Lin DSL Out_channel test with Domain
[ ] 655 0 0 655 / 1000 2890.5s Lin DSL Out_channel test with Domain
[ ] 675 0 0 675 / 1000 3105.7s Lin DSL Out_channel test with Domain
[ ] 695 0 0 695 / 1000 3240.3s Lin DSL Out_channel test with Domain
[ ] 823 0 0 823 / 1000 3302.3s Lin DSL Out_channel test with Domain
[ ] 862 0 0 862 / 1000 3532.9s Lin DSL Out_channel test with Domain
[ ] 980 0 0 980 / 1000 3592.9s Lin DSL Out_channel test with Domain
[✗] 1000 0 0 1000 / 1000 3601.6s Lin DSL Out_channel test with Domain
Notice the jump between input 348 and 349, taking 2000 seconds...
There is definitely something wrong in there. Trying to look into a specific case of this Out_channel test entering seemingly a livelock in the debug runtime, I finally ran into:
$ _build/default/src/io/lin_tests_dsl_domain.exe -v -s 160687358
### OCaml runtime: debug mode ###
random seed: 160687358
generated error fail pass / total time test name
[✓] 3 0 1 2 / 1000 16.2s Lin DSL In_channel test with Domain
[ ] 8 0 0 8 / 1000 18.7s Lin DSL Out_channel test with Domain (shrinking: 26.0008)
[02] file runtime/domain.c; line 310 ### Assertion failed: s->running
(on Linux, OCaml 5.0.0; the same seed seems to get stuck (one way or another) regularly also on trunk)
Just observed a similar behaviour on Windows bytecode trunk
:
https://github.com/ocaml-multicore/multicoretests/actions/runs/5174368273/jobs/9320602093
random seed: 324902162
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s Lin DSL In_channel test with Domain
[ ] 0 0 0 0 / 1000 0.0s Lin DSL In_channel test with Domain (generating)
[✓] 6 0 1 5 / 1000 22.6s Lin DSL In_channel test with Domain
[ ] 0 0 0 0 / 1000 0.0s Lin DSL Out_channel test with Domain
[ ] 16 0 0 16 / 1000 43.6s Lin DSL Out_channel test with Domain
[ ] 47 0 0 47 / 1000 107.5s Lin DSL Out_channel test with Domain
[ ] 65 0 0 65 / 1000 169.3s Lin DSL Out_channel test with Domain
[ ] 104 0 0 104 / 1000 231.8s Lin DSL Out_channel test with Domain
[ ] 129 0 0 129 / 1000 291.9s Lin DSL Out_channel test with Domain
[ ] 166 0 0 166 / 1000 353.2s Lin DSL Out_channel test with Domain
[ ] 211 0 0 211 / 1000 418.4s Lin DSL Out_channel test with Domain
[ ] 238 0 0 238 / 1000 519.7s Lin DSL Out_channel test with Domain
[ ] 298 0 0 298 / 1000 579.8s Lin DSL Out_channel test with Domain
[ ] 367 0 0 367 / 1000 641.3s Lin DSL Out_channel test with Domain
[ ] 427 0 0 427 / 1000 701.4s Lin DSL Out_channel test with Domain
[ ] 509 0 0 509 / 1000 762.3s Lin DSL Out_channel test with Domain
[ ] 585 0 0 585 / 1000 822.7s Lin DSL Out_channel test with Domain
[ ] 619 0 0 619 / 1000 912.1s Lin DSL Out_channel test with Domain
[ ] 670 0 0 670 / 1000 1000.2s Lin DSL Out_channel test with Domain
[ ] 709 0 0 709 / 1000 1475.1s Lin DSL Out_channel test with Domain (shrinking: 0)
[ ] 709 0 0 709 / 1000 3952.3s Lin DSL Out_channel test with Domain (shrinking: 1)
[ ] 709 0 0 709 / 1000 4320.7s Lin DSL Out_channel test with Domain (shrinking: 1.0002)
[ ] 709 0 0 709 / 1000 8293.3s Lin DSL Out_channel test with Domain (shrinking: 1.0003)
^CFatal error: exception User interruption
[ ] 709 0 0 709 / 1000 8503.8s Lin DSL Out_channel test with Domain (shrinking: 2)Terminate batch job (Y/N)?
Error: The operation was canceled.
The next to last shrinking step (from 1.0002 to 1.0003) spends 3972s on 1 successful shrink step (there may be several unsuccessful ones though) - that's 66 minutes - over an hour!
Closing by replacing the Lin test with an STM one in #431
As part of #318 I spotted an
Out_channel
test taking very long on macOS withtrunk
to the point of causing a timeout: https://github.com/ocaml-multicore/multicoretests/actions/runs/4540532958/jobs/8001562877?pr=318That's 2h40m - to test 585 random triples (modulo some missing output flushing) without any shrinking going on. Of those, 80 inputs took ~100m (~6000s) from input 487 to 567. This looks like a runtime issue, doesn't it?