ocaml-multicore / multicoretests

PBT testsuite and libraries for testing multicore OCaml
https://ocaml-multicore.github.io/multicoretests/
BSD 2-Clause "Simplified" License
37 stars 16 forks source link

[ocaml5-issue] Out_channel Lin test takes very long on macOS #321

Closed jmid closed 9 months ago

jmid commented 1 year ago

As part of #318 I spotted an Out_channel test taking very long on macOS with trunk to the point of causing a timeout: https://github.com/ocaml-multicore/multicoretests/actions/runs/4540532958/jobs/8001562877?pr=318

random seed: 419543217
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain (generating)
[✓]   54    0    1   53 / 1000     6.2s Lin DSL In_channel test with Domain

[ ]    0    0    0    0 / 1000     0.0s Lin DSL Out_channel test with Domain
[ ]  215    0    0  215 / 1000    92.8s Lin DSL Out_channel test with Domain
[ ]  302    0    0  302 / 1000   154.7s Lin DSL Out_channel test with Domain
[ ]  384    0    0  384 / 1000   215.6s Lin DSL Out_channel test with Domain
[ ]  407    0    0  407 / 1000   349.9s Lin DSL Out_channel test with Domain
[ ]  417    0    0  417 / 1000   741.2s Lin DSL Out_channel test with Domain
[ ]  487    0    0  487 / 1000  3128.8s Lin DSL Out_channel test with Domain
[ ]  567    0    0  567 / 1000  9321.4s Lin DSL Out_channel test with Domain
[ ]  585    0    0  585 / 1000  9552.4s Lin DSL Out_channel test with Domain
Error: The operation was canceled.

That's 2h40m - to test 585 random triples (modulo some missing output flushing) without any shrinking going on. Of those, 80 inputs took ~100m (~6000s) from input 487 to 567. This looks like a runtime issue, doesn't it?

jmid commented 1 year ago

I had previously observed a problem shrinking Out_channel tests. That this happens without any shrinking, indicates a runtime issue instead :thinking:

jmid commented 1 year ago

Also spotted on macOS 5.0.0 https://github.com/ocaml-multicore/multicoretests/actions/runs/4542931038/jobs/8006976703?pr=316

random seed: 423960020
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain (generating)
[✓]   15    0    1   14 / 1000    13.8s Lin DSL In_channel test with Domain

[ ]    0    0    0    0 / 1000     0.0s Lin DSL Out_channel test with Domain
[ ]   61    0    0   61 / 1000   227.6s Lin DSL Out_channel test with Domain
[ ]   90    0    0   90 / 1000  9110.8s Lin DSL Out_channel test with Domain
[ ]  141    0    0  141 / 1000  9190.7s Lin DSL Out_channel test with Domain
[ ]  203    0    0  203 / 1000  9427.5s Lin DSL Out_channel test with Domain
[✓]  226    0    1  225 / 1000  9440.3s Lin DSL Out_channel test with Domain

This has a jump of nearly 9000s on testing 29 inputs (from 61 to 90)

jmid commented 1 year ago

I tried giving ocaml/ocaml#11903 a spin to see whether it would solve this issue. Apparently not: https://github.com/ocaml-multicore/multicoretests/actions/runs/4688674935/jobs/8309467768

random seed: 478012744
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain (generating)
[✓]  131    0    1  130 / 1000    17.8s Lin DSL In_channel test with Domain

[ ]    0    0    0    0 / 1000     0.0s Lin DSL Out_channel test with Domain
[ ]   23    0    0   23 / 1000   466.5s Lin DSL Out_channel test with Domain
[ ]  135    0    0  135 / 1000   586.1s Lin DSL Out_channel test with Domain
[ ]  199    0    0  199 / 1000   646.2s Lin DSL Out_channel test with Domain
[ ]  348    0    0  348 / 1000   715.3s Lin DSL Out_channel test with Domain
[ ]  349    0    0  349 / 1000  2769.2s Lin DSL Out_channel test with Domain
[ ]  491    0    0  491 / 1000  2830.5s Lin DSL Out_channel test with Domain
[ ]  655    0    0  655 / 1000  2890.5s Lin DSL Out_channel test with Domain
[ ]  675    0    0  675 / 1000  3105.7s Lin DSL Out_channel test with Domain
[ ]  695    0    0  695 / 1000  3240.3s Lin DSL Out_channel test with Domain
[ ]  823    0    0  823 / 1000  3302.3s Lin DSL Out_channel test with Domain
[ ]  862    0    0  862 / 1000  3532.9s Lin DSL Out_channel test with Domain
[ ]  980    0    0  980 / 1000  3592.9s Lin DSL Out_channel test with Domain
[✗] 1000    0    0 1000 / 1000  3601.6s Lin DSL Out_channel test with Domain

Notice the jump between input 348 and 349, taking 2000 seconds...

shym commented 1 year ago

There is definitely something wrong in there. Trying to look into a specific case of this Out_channel test entering seemingly a livelock in the debug runtime, I finally ran into:

$ _build/default/src/io/lin_tests_dsl_domain.exe -v -s 160687358
### OCaml runtime: debug mode ###
random seed: 160687358
generated error fail pass / total     time test name
[✓]    3    0    1    2 / 1000    16.2s Lin DSL In_channel test with Domain
[ ]    8    0    0    8 / 1000    18.7s Lin DSL Out_channel test with Domain (shrinking:   26.0008)
[02] file runtime/domain.c; line 310 ### Assertion failed: s->running

(on Linux, OCaml 5.0.0; the same seed seems to get stuck (one way or another) regularly also on trunk)

jmid commented 1 year ago

Just observed a similar behaviour on Windows bytecode trunk: https://github.com/ocaml-multicore/multicoretests/actions/runs/5174368273/jobs/9320602093

random seed: 324902162
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin DSL In_channel test with Domain (generating)
[✓]    6    0    1    5 / 1000    22.6s Lin DSL In_channel test with Domain

[ ]    0    0    0    0 / 1000     0.0s Lin DSL Out_channel test with Domain
[ ]   16    0    0   16 / 1000    43.6s Lin DSL Out_channel test with Domain
[ ]   47    0    0   47 / 1000   107.5s Lin DSL Out_channel test with Domain
[ ]   65    0    0   65 / 1000   169.3s Lin DSL Out_channel test with Domain
[ ]  104    0    0  104 / 1000   231.8s Lin DSL Out_channel test with Domain
[ ]  129    0    0  129 / 1000   291.9s Lin DSL Out_channel test with Domain
[ ]  166    0    0  166 / 1000   353.2s Lin DSL Out_channel test with Domain
[ ]  211    0    0  211 / 1000   418.4s Lin DSL Out_channel test with Domain
[ ]  238    0    0  238 / 1000   519.7s Lin DSL Out_channel test with Domain
[ ]  298    0    0  298 / 1000   579.8s Lin DSL Out_channel test with Domain
[ ]  367    0    0  367 / 1000   641.3s Lin DSL Out_channel test with Domain
[ ]  427    0    0  427 / 1000   701.4s Lin DSL Out_channel test with Domain
[ ]  509    0    0  509 / 1000   762.3s Lin DSL Out_channel test with Domain
[ ]  585    0    0  585 / 1000   822.7s Lin DSL Out_channel test with Domain
[ ]  619    0    0  619 / 1000   912.1s Lin DSL Out_channel test with Domain
[ ]  670    0    0  670 / 1000  1000.2s Lin DSL Out_channel test with Domain
[ ]  709    0    0  709 / 1000  1475.1s Lin DSL Out_channel test with Domain (shrinking:    0)
[ ]  709    0    0  709 / 1000  3952.3s Lin DSL Out_channel test with Domain (shrinking:    1)
[ ]  709    0    0  709 / 1000  4320.7s Lin DSL Out_channel test with Domain (shrinking:    1.0002)
[ ]  709    0    0  709 / 1000  8293.3s Lin DSL Out_channel test with Domain (shrinking:    1.0003)
^CFatal error: exception User interruption
[ ]  709    0    0  709 / 1000  8503.8s Lin DSL Out_channel test with Domain (shrinking:    2)Terminate batch job (Y/N)? 
Error: The operation was canceled.

The next to last shrinking step (from 1.0002 to 1.0003) spends 3972s on 1 successful shrink step (there may be several unsuccessful ones though) - that's 66 minutes - over an hour!

jmid commented 9 months ago

Closing by replacing the Lin test with an STM one in #431