ocaml-multicore / multicoretests

PBT testsuite and libraries for testing multicore OCaml
https://ocaml-multicore.github.io/multicoretests/
BSD 2-Clause "Simplified" License
37 stars 16 forks source link

Add Out_channel STM tests #431

Closed jmid closed 7 months ago

jmid commented 8 months ago

The Lin Out_channel test is a sore point of the test suite:

Out_channels buffer internally, meaning length's result can vary. This is a bad fit for Lin's sequential consistency test, because we may end up finding counterexamples of different buffering (between a parallel and any interleaved sequential run) - or just spend a long time searching for one. As such, a model-based STM test is a better fit - and this PR offers one.

The new model-based test

Many things thus speak for switching to it.

On the other hand, despite its downsides the old Lin test has also managed to stress the runtime into triggering defects #412

Technically, there are a few nuggets in the test code:

Closes #378 and #401

jmid commented 8 months ago

CI summary for 2150652:

Out of 59 workflows 7 failed with 1 genuine failure and 6 ci setup issues

jmid commented 8 months ago

I've rebased the PR on main after the merge of #429

jmid commented 8 months ago

CI summary for 024dd85

Out of 44 workflows 5 failed with all 5 being genuine failures.

shym commented 8 months ago

On the other hand, despite its downsides the old Lin test has also managed to stress the runtime into triggering defects #412

Do you think the STM tests are less likely to trigger such defects?

jmid commented 8 months ago

Thanks for the review!

Discussing this PR with you triggered me to look up previous counterexamples reported by the Lin Out_channel test. Indeed, the last merge to main a571972 triggered, e.g., the following for the Linux 5.2 debug workflow:

Messages for test Lin Out_channel test with Domain:

  Results incompatible with sequential execution

                                           |                  
                         Out_channel.output_byte t 2 : Ok (())
                                           |                  
                       .--------------------------------------.
                       |                                      |                  
         Out_channel.length t : Ok (1)         Out_channel.close_noerr t : ()   

This should have a possible sequentialization (below) which we seem to miss:

Out_channel.output_byte t 2;;
Out_channel.length t;;
Out_channel.close_noerr t;;

Looking at previous counterexamples also made me realize that

Our discussion also made me check my recollection regarding locking: https://github.com/ocaml/ocaml/blob/cedff5854ac91dccf5847b527192223ef506b1e2/runtime/io.c#L60

All operations on channels first take the channel lock.

As such, they Out_channel operations should act atomically when used in parallel and be sequential consistent.

On the other hand, despite its downsides the old Lin test has also managed to stress the runtime into triggering defects #412

Do you think the STM tests are less likely to trigger such defects?

Possibly, but I'm unsure.

Since #interleavings can be quite high, a corner case bug is more likely to trigger more often when it is run more. On the other hand, when considering this behaviour across a count of 1000, the negative Lin test will finish sooner (and report a "counterexample") whereas the current STM test runs to completion (1000 inputs * 25 repetitions).

jmid commented 8 months ago

In 8236876 I extend the STM tests to be able to generate close and close_noexn in state Closed too.

shym commented 8 months ago

This should have a possible sequentialization (below) which we seem to miss:

In the sequentialized version, the length will most probably be 0, as the buffer is not flushed; the output we see is due to the close_noerr being half-run, having flushed but not closed yet (as those two steps are explicitly not run atomically), isn’t it?

This is a gray-zone specification-wise, which does not offer an immediate benefit AFAICS.

To make sure that I understand correctly what this is doing, because I think I missed it in my first review: due to the postcond function, we’ll never test close or close_noerr in one of the parallel branches except when the other branch is only doing open and close calls, or am I mistaken in how postcond is used in all_interleavings_ok? This would mean those tests would never run into issues such as ocaml/ocaml#11878 I suppose?

jmid commented 8 months ago

This should have a possible sequentialization (below) which we seem to miss:

In the sequentialized version, the length will most probably be 0, as the buffer is not flushed; the output we see is due to the close_noerr being half-run, having flushed but not closed yet (as those two steps are explicitly not run atomically), isn’t it?

This is a probable explanation indeed! :+1: However it seems a bit unsatisfying that this stops testing and is reported as a counterexample of parallelism-safety.

This is a gray-zone specification-wise, which does not offer an immediate benefit AFAICS.

To make sure that I understand correctly what this is doing, because I think I missed it in my first review: due to the postcond function, we’ll never test close or close_noerr in one of the parallel branches except when the other branch is only doing open and close calls, or am I mistaken in how postcond is used in all_interleavings_ok? This would mean those tests would never run into issues such as ocaml/ocaml#11878 I suppose?

Point well taken - and yes, you are right :+1: Following the Erlang version described in the ICFP'09 paper, STM will generate candidate "Y"-triples and only keep those which satisfy all preconditions in all interleavings (which is what all_interleavings_ok should express). This indeed leaves out two parallel close* cmds.

I've therefore taken a pass over the STM test and extended the cmds so that all of them can also occur in state Closed. The only generator and precond limitation now is that Open_text is not allowed in state Open (this was also a limitation of the Lin test).

This revealed a divergence from the spec as I shared offline:

val close : t -> unit
(** Close the given channel, flushing all buffered write operations.  Output
    functions raise a [Sys_error] exception when they are applied to a closed
    output channel, except {!close} and {!flush}, which do nothing when applied
    to an already closed channel.  Note that {!close} may raise [Sys_error] if
    the operating system signals an error when flushing or closing. *)

Yet output_string, output_bytes, output, and output_substring of length 0 on a Closed channel does not fail, e.g.:

--- Failure --------------------------------------------------------------------

Test STM Out_channel test sequential failed (22 shrink steps):

   Close
   Output_string ""

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test sequential:

  Results incompatible with model

   Close : Ok (())
   Output_string "" : Ok (())

For now, I've adjust postcond to accept these cornercases. We should consider submitting a PR to adjust the specification though...

jmid commented 8 months ago

This would mean those tests would never run into issues such as ocaml/ocaml#11878 I suppose?

We can now experimentally confirm that the latest changes can indeed find issues like that. All 5.2 and 5.3/trunk workflows show a regression on outputting on a closed Out_channel, where only the first cmd now triggers an exception: :open_mouth:

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test STM Out_channel test sequential:

  Results incompatible with model

   Close_noerr : Ok (())
   Output_string "}\169)gN\255\017" : Error (Sys_error("Bad file descriptor"))
   Output_byte 48 : Ok (())
jmid commented 8 months ago

Some overdue CI summaries...

For 9c63209:

Out of 44 workflows 5 failed and 1 was cancelled, all 6 due to genuine issues

For 8236876:

Out of 44 workflows 5 failed with all 5 being genuine issues

For 0c8fdcb:

Out of 44 workflows 24 failed with a total of 28 failures (4 workflows failed two tests) with all of them being genuine

jmid commented 8 months ago

The PR's latest test revision is so effective at triggering the #432 regression that we should consider temporarily relaxing the test to accept the current behaviour, to avoid having to check ~20 failing workflows on each CI run... :grimacing:

jmid commented 7 months ago

I rebased on main to avoid more false alarms and added 86f2ab2 to temporarily disable the #432 regression reports.

jmid commented 7 months ago

CI summary:

Out of 44 workflows 2 failed, with both being genuine issues

jmid commented 7 months ago

CI summary for merge to main:

Out of 45 workflows 1 failed with a genuine issue