Open jmid opened 7 months ago
Just observed this again - now on FreeBSD 5.1 trying out the latest QCheck memory usage improvement https://github.com/ocaml-multicore/multicoretests/tree/try-qcheck-mem-improvement https://ocaml-multicoretests.ci.dev:8100/job/2024-06-28/125000-ci-ocluster-build-caa2d4
random seed: 362318939
generated error fail pass / total time test name
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test sequential (generating)
[✓] 1000 0 0 1000 / 1000 1.2s STM Out_channel test sequential
[ ] 0 0 0 0 / 1000 0.0s STM Out_channel test parallel
[✗] 103 0 1 90 / 1000 4.7s STM Out_channel test parallel
--- Failure --------------------------------------------------------------------
Test STM Out_channel test parallel failed (1 shrink steps):
|
Output ("\011\230\197\204\173\164'\024", 9, 1)
Output_bytes "\159\238+\152]\006*\24... (truncated)
Close_noerr
Open_text
Flush
Flush
|
.----------------------------------------------------.
| |
Length Close_noerr
Output_char 'i' Output ("Dc\015\194*x\159YlA\b\247\1... (truncated)
Length Open_text
Output_byte 8 Output_char 'G'
Flush Set_binary_mode false
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test parallel:
Results incompatible with linearized model
|
Output ("\011\230\197\204\173\164'\024", 9, 1) : Error (Invalid_argument("output"))
Output_bytes "\159\238+\152]\006*\24... (truncated) : Ok (())
Close_noerr : Ok (())
Open_text : Ok (())
Flush : Ok (())
Flush : Ok (())
|
.-----------------------------------------------------------------------------------------------.
| |
Length : Ok (0) Close_noerr : Ok (())
Output_char 'i' : Ok (()) Output ("Dc\015\194*x\159YlA\b\247\1... (truncated) : Error (Sys_error("Bad file descriptor"))
Length : Ok (0) Open_text : Ok (())
Output_byte 8 : Ok (()) Output_char 'G' : Ok (())
Flush : Error (Sys_error("Bad file descriptor")) Set_binary_mode false : Ok (())
================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 | (name stm_tests)
^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.
I've created a standalone reproducer for this one:
let test () =
let path = Filename.temp_file "stm-" "" in
let channel = Atomic.make (Out_channel.open_text path) in
(* First, a bit of one-domain channel activity *)
Out_channel.output_byte (Atomic.get channel) 1;
(try Out_channel.flush (Atomic.get channel) with (Sys_error _) -> assert false);
let wait = Atomic.make true in
(* Domain 1 closes-opens-outputs repeatedly *)
let d1 = Domain.spawn (fun () ->
while Atomic.get wait do Domain.cpu_relax() done;
for _ = 1 to 50 do
Out_channel.close (Atomic.get channel);
Atomic.set channel (Out_channel.open_text path);
Out_channel.output_byte (Atomic.get channel) 1;
done;
) in
(* Domain 2 calls flush and pos repeatedly *)
let d2 = Domain.spawn (fun () ->
Atomic.set wait false;
(*
"Output functions raise a Sys_error exception when they are applied to a closed output channel,
except Out_channel.close and Out_channel.flush , which do nothing when applied to an already closed channel."
*)
for _ = 1 to 50 do
(try Out_channel.flush (Atomic.get channel)
with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false);
ignore (Out_channel.pos (Atomic.get channel));
done;
) in
let () = Domain.join d1 in
let () = Domain.join d2 in
(* Please leave the torture chamber nice and clean as you found it *)
(try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
Sys.remove path
let _ =
for i = 1 to 5_000 do
if i mod 250 = 0 then Printf.printf "#%!";
test ()
done
With it, locally on Linux I am able to trigger the assertion failure on 5.1.0+fp and 5.2.0+fp switches.
The branch https://github.com/ocaml-multicore/multicoretests/tree/flushtest-repro-focus triggers a focused test of it on the CI.
Across all runs, this reproducer only triggered in multicoretests-ci where flush
may trigger a Sys_error
in
In trying to understand this better, I ran dune exec src/io/stm_tests.exe -- -v
locally under a 5.2.0+fp switch, where it fails spectacularly - even the sequential one:
$ dune exec src/io/stm_tests.exe -- -v
random seed: 242921126
generated error fail pass / total time test name
[✗] 20 0 1 19 / 1000 0.1s STM Out_channel test sequential
[✗] 8 0 1 7 / 1000 5.9s STM Out_channel test parallel
--- Failure --------------------------------------------------------------------
Test STM Out_channel test sequential failed (19 shrink steps):
Close_noerr
Output_char 'H'
Output_char ']'
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test sequential:
Results incompatible with model
Close_noerr : Ok (())
Output_char 'H' : Error (Sys_error("Bad file descriptor"))
Output_char ']' : Ok (())
--- Failure --------------------------------------------------------------------
Test STM Out_channel test parallel failed (45 shrink steps):
|
Close
Output_byte 4
Output_char 'E'
|
.---------------------.
| |
+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Messages for test STM Out_channel test parallel:
Results incompatible with linearized model
|
Close : Ok (())
Output_byte 4 : Error (Sys_error("Bad file descriptor"))
Output_char 'E' : Ok (())
|
.---------------------------------------------------------.
| |
================================================================================
failure (2 tests failed, 0 tests errored, ran 2 tests)
The same test however completes under the following two which seems to indicate a fp-related bug that may have since been fixed on trunk:
A standalone reproducer:
let test () =
let path = Filename.temp_file "stm-" "" in
let channel = Out_channel.open_text path in
Out_channel.close_noerr channel;
(try Out_channel.output_char channel 'A'; assert false with (Sys_error _) -> ()); (* Error (Sys_error("Bad file descriptor")) *)
(try Out_channel.output_char channel 'B'; assert false with (Sys_error _) -> ()); (* Succeeds *)
Sys.remove path
let _ =
for i = 1 to 10_000 do
if i mod 250 = 0 then Printf.printf "#%!";
test ()
done
To summarize
5.2.0
failure was actually in 5.2.0~alpha1
. It hasn't shown up since #12678 was merged (into 5.2.0~beta1)5.1.1
- not including #12678This means that we are only missing an explanation of the above 5.2.0+fp standalone reproducers to close this issue.
I ran into this too when running the test suite with Jane Street's branch of the OCaml compiler. My explanation seems to apply equally to the most recent head of ocaml/ocaml, so I thought I'd share it here.
Explanation: the STM model expects that Out_channel.flush
will never raise if interleaved with Out_channel.close
. However, Out_channel.flush
can raise if you get an unlucky concurrent interleaving with Out_channel.close
. In particular:
Out_channel.flush
is called before the close happens, and this guard is satisfied (given the channel is still open): https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L822-L823check_pending
in caml_flush_partial
relinquishes Out_channel.flush
’s lock on the channel: https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L118-L132 Out_channel.close
is called and completes, setting channel->buffer + 1 = channel->curr
(https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L721-L730)caml_flush_partial
is called, and this check indicates that there is exactly 1 byte to write (given the previous bullet point).Sys_error
If my explanation is correct, then maybe caml_flush_partial
should check if the channel is closed before attempting to write to it.
(The issue is probably more widespread than suggested by my last message: I see that caml_flush
is called by lots of functions, not just caml_ml_flush
. e.g. outputting a string. All of these operations can thus relinquish the channel lock midoperation.)
Thanks for sharing! :pray: Your explanation makes sense to me. I can't reproduce the error ATM though... I'll need to look more into it.
I've now created a reproducer and filed this upstream: https://github.com/ocaml/ocaml/issues/13586
A Cygwin 5.2 run triggered when merging #443 to main found an unexpected counterexample to
STM Out_channel parallel
: https://github.com/ocaml-multicore/multicoretests/actions/runs/8361686850/job/22890359851AFAICS, the failure can be explained by
Flush
in the "right leg" causingSys_error("Bad file descriptor")
. This goes against the specification, which says:This is currently captured in the STM Out_channel test as: