ocaml-multicore / picos

Interoperable effects based concurrency
https://ocaml-multicore.github.io/picos/doc/picos/index.html
ISC License
86 stars 3 forks source link

`Picos_mpmcq` failure #224

Closed polytypic closed 1 month ago

polytypic commented 2 months ago

Seems that Picos_mpmcq is not working correctly:

random seed: 966237474453890113
generated error fail pass / total     time test name
[✓]  128    0    0  128 /  128     0.0s Picos_mpmcq sequential
[✗]   97    0    1   96 /  128     3.4s Picos_mpmcq parallel

--- Failure --------------------------------------------------------------------

Test Picos_mpmcq parallel failed (0 shrink steps):

                           |
                       Push 764
                       Push 203
                           |
                .---------------------.
                |                     |
            Push 300               Pop_opt
             Pop_opt               Pop_opt
          Push_head 107           Push 626
            Push 395              Push 355
                                Push_head 892

+++ Messages ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Messages for test Picos_mpmcq parallel:

  Results incompatible with linearized model

                                 |
                           Push 764 : ()
                           Push 203 : ()
                                 |
              .------------------------------------.
              |                                    |
        Push 300 : ()                     Pop_opt : Some (764)
     Pop_opt : Some (203)                    Pop_opt : None
      Push_head 107 : ()                     Push 626 : ()
        Push 395 : ()                        Push 355 : ()
                                           Push_head 892 : ()

================================================================================
failure (1 tests failed, 0 tests errored, ran 2 tests)

A DSCheck based test should be added and the issue fixed.

polytypic commented 1 month ago

I experimented with a few different DSCheck based tests, but couldn't reproduce a failure. Below is one based on the above case.

module Atomic = Dscheck.TracedAtomic
module Q = Picos_mpmcq

let popped_some q =
  let b =
    match Picos_mpmcq.pop_exn q with
    | _ -> true
    | exception Picos_mpmcq.Empty -> false
  in
  Atomic.check (fun () -> b)

let test_multi_push_pop () =
  Atomic.trace @@ fun () ->
  let q = Q.create () in
  Q.push q 764;
  Q.push q 203;
  begin
    Atomic.spawn @@ fun () ->
    Q.push q 300;
    popped_some q;
    Q.push_head q 107;
    Q.push q 395
  end;
  begin
    Atomic.spawn @@ fun () ->
    popped_some q;
    popped_some q;
    Q.push q 626
    (*Q.push q 355;*)
    (*Q.push_head q 892*)
  end;
  Atomic.final @@ fun () ->
  for _ = 1 to 3 do
    popped_some q
  done

let () =
  Alcotest.run "Picos_mpmcq DSCheck"
    [
      ( "Multiple pushes and pops",
        [ Alcotest.test_case "" `Quick test_multi_push_pop ] );
    ]

This suggests to me that the problem might be with the use of fenceless (or relaxed) atomic operations, because the DSCheck checker does not know about such operations (and treats them as sequentially consistent operations).

In the pop logic there was a case where the head and tail were both read with a fenceless operation and then a third read of the head was made with a sequentially consistent read. If the third read matched the previous read of the head then the queue was determined to be empty. Here is a relevant snippet:

let rec pop t backoff = function
  | H (Cons cons_r as cons) -> (* ... *)
  | H (Head head_r as head) -> begin
      match Atomic.fenceless_get t.tail with
      | T (Snoc snoc_r as move) -> (* ... *)
      | T (Tail tail_r as tail) -> begin
          match tail_r.move with
          | Used -> pop_emptyish t backoff head
          | Snoc _ as move -> (* ... *)
        end
    end

(* ... *)

and pop_emptyish t backoff head =
  let new_head = Atomic.get t.head in
  if new_head == H head then raise_notrace Empty else pop t backoff new_head

let pop_exn t = pop t Backoff.default (Atomic.fenceless_get t.head)

This is clearly not correct, because the read of the tail might happen before the first read of the head. Changing the read of the tail to be Atomic.get prevents that.