[ocaml5-issue] `Out_channel.flush` can cause `Syserror` when used in parallel

jmid commented 7 months ago

A Cygwin 5.2 run triggered when merging #443 to main found an unexpected counterexample to STM Out_channel parallel: https://github.com/ocaml-multicore/multicoretests/actions/runs/8361686850/job/22890359851

random seed: 292949458
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential (generating)
[✓] 1000    0    0 1000 / 1000     2.9s STM Out_channel test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test parallel
[ ]  620    0    0  542 / 1000    57.1s STM Out_channel test parallel
[ ] 1116    0    0  966 / 1000   117.5s STM Out_channel test parallel (shrinking:    1.0011)
[✗] 1117    0    1  966 / 1000   129.8s STM Out_channel test parallel

--- Failure --------------------------------------------------------------------

Test STM Out_channel test parallel failed (1 shrink steps):

                                                              |                         
                                     Output_substring ("\225\185!\188Q\142\138", 0, 96) 
                                                       Output_byte 36                   
                                     Output_bytes "\158\199e\220\253\148\... (truncated)
                                                            Flush                       
                                                            Close                       
                                                   Output ("Z\197j", 3, 3)              
                                                          Open_text                     
                                                         Is_buffered                    
                                                    Set_binary_mode false               
                                              Output_substring ("{8)g", 18, 1)          
                                                           Length                       
                                                         Close_noerr                    
                                                          Open_text                     
                                                      Set_buffered true                 
                                                              |                         
                                   .----------------------------------------------------.
                                   |                                                    |                         
                              Close_noerr                                     Set_binary_mode true                
                               Open_text                                               Pos                        
                Output_substring ("\to\128N\186", 7, 6)                       Set_binary_mode true                
                            Output_char '9'                                      Output_byte 54                   
                         Set_binary_mode true                                        Length                       
                         Output_bytes "\182U"                                         Flush                       
                            Output_char '5'                    Output_substring ("\014\236\187\146^... (truncated)
              Output ("(|$\217\203\206\162t\235", 18, 4)                      Set_binary_mode true                
                                  Pos                                              Is_buffered                    
                         Set_binary_mode false                                    Output_byte 1                   

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test parallel:

  Results incompatible with linearized model

                                                                                                       |                                                
                                                       Output_substring ("\225\185!\188Q\142\138", 0, 96) : Error (Invalid_argument("output_substring"))
                                                                                           Output_byte 36 : Ok (())                                     
                                                                         Output_bytes "\158\199e\220\253\148\... (truncated) : Ok (())                  
                                                                                                Flush : Ok (())                                         
                                                                                                Close : Ok (())                                         
                                                                         Output ("Z\197j", 3, 3) : Error (Invalid_argument("output"))                   
                                                                                              Open_text : Ok (())                                       
                                                                                            Is_buffered : Ok (true)                                     
                                                                                        Set_binary_mode false : Ok (())                                 
                                                                Output_substring ("{8)g", 18, 1) : Error (Invalid_argument("output_substring"))         
                                                                                                Length : Ok (0)                                         
                                                                                             Close_noerr : Ok (())                                      
                                                                                              Open_text : Ok (())                                       
                                                                                          Set_buffered true : Ok (())                                   
                                                                                                       |                                                
                                                     .--------------------------------------------------------------------------------------------------.
                                                     |                                                                                                  |                                                
                                           Close_noerr : Ok (())                                                                         Set_binary_mode true : Ok (())                                  
                                            Open_text : Ok (())                                                                                   Pos : Ok (0)                                           
          Output_substring ("\to\128N\186", 7, 6) : Error (Invalid_argument("output_substring"))                                         Set_binary_mode true : Ok (())                                  
                                         Output_char '9' : Ok (())                                                                          Output_byte 54 : Ok (())                                     
                                      Set_binary_mode true : Ok (())                                                                             Length : Ok (0)                                         
                                      Output_bytes "\182U" : Ok (())                                                            Flush : Error (Sys_error("Bad file descriptor"))                         
                                         Output_char '5' : Ok (())                                                        Output_substring ("\014\236\187\146^... (truncated) : Ok (())                  
              Output ("(|$\217\203\206\162t\235", 18, 4) : Error (Invalid_argument("output"))                            Set_binary_mode true : Error (Sys_error("Bad file descriptor"))                 
                                               Pos : Ok (4)                                                                                  Is_buffered : Ok (true)                                     
                                      Set_binary_mode false : Ok (())                                                                        Output_byte 1 : Ok (())                                     

================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 |  (name stm_tests)
            ^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.

AFAICS, the failure can be explained by Flush in the "right leg" causing Sys_error("Bad file descriptor"). This goes against the specification, which says:

       val close : t -> unit

       Close the given channel, flushing all buffered write operations.  Output functions raise a Sys_error exception when they are  applied  to  a  closed  output
       channel, except Out_channel.close and Out_channel.flush , which do nothing when applied to an already closed channel. [...]

This is currently captured in the STM Out_channel test as:

    | Flush, Res ((Result (Unit,Exn),_), r) ->
      (match s,r with
       | Closed, Error (Sys_error _) -> false (* should not generate an error *)
       | Closed, Ok () -> true
       | Open _, Ok () -> true
       | _ -> false)

jmid commented 4 months ago

Just observed this again - now on FreeBSD 5.1 trying out the latest QCheck memory usage improvement https://github.com/ocaml-multicore/multicoretests/tree/try-qcheck-mem-improvement https://ocaml-multicoretests.ci.dev:8100/job/2024-06-28/125000-ci-ocluster-build-caa2d4

random seed: 362318939
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test sequential (generating)
[✓] 1000    0    0 1000 / 1000     1.2s STM Out_channel test sequential

[ ]    0    0    0    0 / 1000     0.0s STM Out_channel test parallel
[✗]  103    0    1   90 / 1000     4.7s STM Out_channel test parallel

--- Failure --------------------------------------------------------------------

Test STM Out_channel test parallel failed (1 shrink steps):

                                                              |                         
                                       Output ("\011\230\197\204\173\164'\024", 9, 1)   
                                     Output_bytes "\159\238+\152]\006*\24... (truncated)
                                                         Close_noerr                    
                                                          Open_text                     
                                                            Flush                       
                                                            Flush                       
                                                              |                         
                                   .----------------------------------------------------.
                                   |                                                    |                         
                                Length                                             Close_noerr                    
                            Output_char 'i'                    Output ("Dc\015\194*x\159YlA\b\247\1... (truncated)
                                Length                                              Open_text                     
                             Output_byte 8                                       Output_char 'G'                  
                                 Flush                                        Set_binary_mode false               

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test parallel:

  Results incompatible with linearized model

                                                                                                   |                                               
                                                          Output ("\011\230\197\204\173\164'\024", 9, 1) : Error (Invalid_argument("output"))      
                                                                     Output_bytes "\159\238+\152]\006*\24... (truncated) : Ok (())                 
                                                                                         Close_noerr : Ok (())                                     
                                                                                          Open_text : Ok (())                                      
                                                                                            Flush : Ok (())                                        
                                                                                            Flush : Ok (())                                        
                                                                                                   |                                               
                                                   .-----------------------------------------------------------------------------------------------.
                                                   |                                                                                               |                                               
                                            Length : Ok (0)                                                                              Close_noerr : Ok (())                                     
                                       Output_char 'i' : Ok (())                                     Output ("Dc\015\194*x\159YlA\b\247\1... (truncated) : Error (Sys_error("Bad file descriptor"))
                                            Length : Ok (0)                                                                               Open_text : Ok (())                                      
                                        Output_byte 8 : Ok (())                                                                        Output_char 'G' : Ok (())                                   
                            Flush : Error (Sys_error("Bad file descriptor"))                                                        Set_binary_mode false : Ok (())                                

================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)
File "src/io/dune", line 40, characters 7-16:
40 |  (name stm_tests)
            ^^^^^^^^^
(cd _build/default/src/io && ./stm_tests.exe --verbose)
Command exited with code 1.

jmid commented 3 months ago

I've created a standalone reproducer for this one:

let test () =
  let path = Filename.temp_file "stm-" "" in
  let channel = Atomic.make (Out_channel.open_text path) in

  (* First, a bit of one-domain channel activity *)
  Out_channel.output_byte (Atomic.get channel) 1;
  (try Out_channel.flush (Atomic.get channel) with (Sys_error _) -> assert false);

  let wait = Atomic.make true in

  (* Domain 1 closes-opens-outputs repeatedly *)
  let d1 = Domain.spawn (fun () ->
      while Atomic.get wait do Domain.cpu_relax() done;
      for _ = 1 to 50 do
        Out_channel.close (Atomic.get channel);
        Atomic.set channel (Out_channel.open_text path);
        Out_channel.output_byte (Atomic.get channel) 1;
      done;
    ) in

  (* Domain 2 calls flush and pos repeatedly *)
  let d2 = Domain.spawn (fun () ->
      Atomic.set wait false;
      (*
       "Output functions raise a Sys_error exception when they are applied to a  closed  output  channel,
        except  Out_channel.close  and Out_channel.flush , which do nothing when applied to an already closed channel."
      *)
      for _ = 1 to 50 do
        (try Out_channel.flush (Atomic.get channel)
         with (Sys_error msg) -> Printf.printf "Out_channel.flush raised Sys_error %S\n%!" msg; assert false);
        ignore (Out_channel.pos (Atomic.get channel));
      done;
    ) in

  let () = Domain.join d1 in
  let () = Domain.join d2 in
  (* Please leave the torture chamber nice and clean as you found it *)
  (try Out_channel.close (Atomic.get channel) with Sys_error _ -> ());
  Sys.remove path

let _ =
  for i = 1 to 5_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done

With it, locally on Linux I am able to trigger the assertion failure on 5.1.0+fp and 5.2.0+fp switches.

jmid commented 3 months ago

The branch https://github.com/ocaml-multicore/multicoretests/tree/flushtest-repro-focus triggers a focused test of it on the CI.

Across all runs, this reproducer only triggered in multicoretests-ci where flush may trigger a Sys_error in

linux-arm64-5.1 https://ocaml-multicoretests.ci.dev:8100/job/2024-08-09/141442-ci-ocluster-build-d1fac4
linux-s390x-5.1 https://ocaml-multicoretests.ci.dev:8100/job/2024-08-09/141442-ci-ocluster-build-96b82e

jmid commented 3 months ago

In trying to understand this better, I ran dune exec src/io/stm_tests.exe -- -v locally under a 5.2.0+fp switch, where it fails spectacularly - even the sequential one:

$ dune exec src/io/stm_tests.exe -- -v
random seed: 242921126
generated error fail pass / total     time test name
[✗]   20    0    1   19 / 1000     0.1s STM Out_channel test sequential
[✗]    8    0    1    7 / 1000     5.9s STM Out_channel test parallel

--- Failure --------------------------------------------------------------------

Test STM Out_channel test sequential failed (19 shrink steps):

   Close_noerr
   Output_char 'H'
   Output_char ']'

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test sequential:

  Results incompatible with model

   Close_noerr : Ok (())
   Output_char 'H' : Error (Sys_error("Bad file descriptor"))
   Output_char ']' : Ok (())

--- Failure --------------------------------------------------------------------

Test STM Out_channel test parallel failed (45 shrink steps):

                            |            
                          Close          
                      Output_byte 4      
                     Output_char 'E'     
                            |            
                 .---------------------.
                 |                     |            

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test parallel:

  Results incompatible with linearized model

                                                             |                            
                                                      Close : Ok (())                     
                                  Output_byte 4 : Error (Sys_error("Bad file descriptor"))
                                                 Output_char 'E' : Ok (())                
                                                             |                            
                                .---------------------------------------------------------.
                                |                                                         |                            

================================================================================
failure (2 tests failed, 0 tests errored, ran 2 tests)

The same test however completes under the following two which seems to indicate a fp-related bug that may have since been fixed on trunk:

5.2.0 (no fp)
5.3.0+fp

A standalone reproducer:

let test () =
  let path = Filename.temp_file "stm-" "" in
  let channel = Out_channel.open_text path in

  Out_channel.close_noerr channel;
  (try Out_channel.output_char channel 'A'; assert false with (Sys_error _) -> ()); (* Error (Sys_error("Bad file descriptor")) *)
  (try Out_channel.output_char channel 'B'; assert false with (Sys_error _) -> ()); (* Succeeds *)

  Sys.remove path

let _ =
  for i = 1 to 10_000 do
    if i mod 250 = 0 then Printf.printf "#%!";
    test ()
  done

jmid commented 3 months ago

To summarize

the Cygwin 5.2.0 failure was actually in 5.2.0~alpha1. It hasn't shown up since #12678 was merged (into 5.2.0~beta1)
the FreeBSD 5.1 failure was in 5.1.1 - not including #12678
linux-arm64-5.1 - same
linux-s390x-5.1 - same

This means that we are only missing an explanation of the above 5.2.0+fp standalone reproducers to close this issue.

ncik-roberts commented 3 weeks ago

I ran into this too when running the test suite with Jane Street's branch of the OCaml compiler. My explanation seems to apply equally to the most recent head of ocaml/ocaml, so I thought I'd share it here.

Explanation: the STM model expects that Out_channel.flush will never raise if interleaved with Out_channel.close. However, Out_channel.flush can raise if you get an unlucky concurrent interleaving with Out_channel.close. In particular:

Out_channel.flush is called before the close happens, and this guard is satisfied (given the channel is still open): https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L822-L823
check_pending in caml_flush_partial relinquishes Out_channel.flush’s lock on the channel: https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L118-L132
Out_channel.close is called and completes, setting channel->buffer + 1 = channel->curr (https://github.com/ocaml/ocaml/blob/fb5e30cb40603c75b5d19987fcd7911cc003d077/runtime/io.c#L721-L730)
The execution of flush continues. caml_flush_partial is called, and this check indicates that there is exactly 1 byte to write (given the previous bullet point).
The write fails, and so this code raises Sys_error

If my explanation is correct, then maybe caml_flush_partial should check if the channel is closed before attempting to write to it.

ncik-roberts commented 3 weeks ago

(The issue is probably more widespread than suggested by my last message: I see that caml_flush is called by lots of functions, not just caml_ml_flush. e.g. outputting a string. All of these operations can thus relinquish the channel lock midoperation.)

jmid commented 3 weeks ago

Thanks for sharing! :pray: Your explanation makes sense to me. I can't reproduce the error ATM though... I'll need to look more into it.

jmid commented 2 weeks ago

I've now created a reproducer and filed this upstream: https://github.com/ocaml/ocaml/issues/13586

ocaml-multicore / multicoretests

[ocaml5-issue] `Out_channel.flush` can cause `Syserror` when used in parallel #444