ocaml / ocaml

The core OCaml system: compilers, runtime system, base libraries
https://ocaml.org
Other
5.4k stars 1.09k forks source link

TSan reports a race with the GC #13427

Open talex5 opened 1 week ago

talex5 commented 1 week ago

Running an Eio program with 5.2.0-tsan, I get:

WARNING: ThreadSanitizer: data race (pid=56677)
  Write of size 8 at 0x7f6fea842f48 by main thread (mutexes: write M117):
    #0 camlEio__core__Switch.dec_fibers_1041 lib_eio/core/switch.ml:71 (main.exe+0xdeb20)
    #1 camlStdlib__Fun.finally_no_exn_330 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/fun.ml:30 (main.exe+0x17a352)
    #2 camlStdlib__Fun.protect_326 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/fun.ml:35 (main.exe+0x17a2ae)
    #3 caml_runstack <null> (main.exe+0x207f29)
    #4 camlStdlib__Fun.protect_326 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/fun.ml:34 (main.exe+0x17a23a)
    #5 camlEio_mock__Backend.fun_965 lib_eio/mock/backend.ml:107 (main.exe+0xba184)
    #6 camlDune__exe__Main.entry examples/hello/main.ml:20 (main.exe+0xb813b)
    #7 caml_program <null> (main.exe+0xb519e)
    #8 caml_start_program <null> (main.exe+0x20774f)
    #9 caml_startup_common runtime/startup_nat.c:132 (main.exe+0x206e15)
    #10 caml_startup_common runtime/startup_nat.c:88 (main.exe+0x206e15)
    #11 caml_startup_exn runtime/startup_nat.c:139 (main.exe+0x206ed7)
    #12 caml_startup runtime/startup_nat.c:144 (main.exe+0x206ed7)
    #13 caml_main runtime/startup_nat.c:151 (main.exe+0x206ed7)
    #14 main runtime/main.c:37 (main.exe+0xb4965)

  Previous atomic read of size 8 at 0x7f6fea842f48 by thread T13 (mutexes: write M152):
    #0 mark_stack_push_block runtime/major_gc.c:895 (main.exe+0x1e5868)
    #1 caml_darken runtime/major_gc.c:1230 (main.exe+0x1e762e)
    #2 caml_darken runtime/major_gc.c:1208 (main.exe+0x1e762e)
    #3 scan_stack_frames runtime/fiber.c:264 (main.exe+0x1d1076)
    #4 caml_scan_stack runtime/fiber.c:285 (main.exe+0x1d1076)
    #5 caml_darken_cont runtime/major_gc.c:1199 (main.exe+0x1e7421)
    #6 do_some_marking runtime/major_gc.c:1033 (main.exe+0x1e8507)
    #7 mark runtime/major_gc.c:1148 (main.exe+0x1e86bb)
    #8 major_collection_slice runtime/major_gc.c:1725 (main.exe+0x1e8fb7)
    #9 caml_major_collection_slice runtime/major_gc.c:1905 (main.exe+0x1ea014)
    #10 caml_poll_gc_work runtime/domain.c:1751 (main.exe+0x1c8e5d)
    #11 stw_handler runtime/domain.c:1401 (main.exe+0x1c9315)
    #12 handle_incoming runtime/domain.c:340 (main.exe+0x1c9315)
    #13 caml_handle_incoming_interrupts runtime/domain.c:353 (main.exe+0x1ca4bb)
    #14 caml_handle_gc_interrupt runtime/domain.c:1774 (main.exe+0x1ca4bb)
    #15 caml_check_urgent_gc runtime/minor_gc.c:903 (main.exe+0x1f0f20)
    #16 caml_alloc_string runtime/alloc.c:188 (main.exe+0x1b8944)
    #17 caml_create_bytes runtime/str.c:78 (main.exe+0x1fca97)
    #18 caml_c_call <null> (main.exe+0x2075d5)
    #19 camlStdlib__Bytes.make_282 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/bytes.ml:42 (main.exe+0x12d7d2)
    #20 camlDune__exe__Main.fun_663 /workspace_root/string.ml:36 (main.exe+0xb7c0a)
    #21 camlEio__core__Fiber.fun_1434 lib_eio/core/fiber.ml:21 (main.exe+0xe6716)
    #22 camlStdlib__Fun.protect_326 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/fun.ml:34 (main.exe+0x17a23a)
    #23 caml_runstack <null> (main.exe+0x207f29)
    #24 camlStdlib__Fun.protect_326 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/fun.ml:34 (main.exe+0x17a23a)
    #25 camlEio_mock__Backend.fun_965 lib_eio/mock/backend.ml:107 (main.exe+0xba184)
    #26 camlStdlib__Fun.protect_326 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/fun.ml:34 (main.exe+0x17a23a)
    #27 camlStdlib__Domain.body_735 /home/user/.opam/5.2.0-tsan/.opam-switch/build/ocaml-variants.5.2.0+options/stdlib/domain.ml:263 (main.exe+0x15861f)
    #28 caml_start_program <null> (main.exe+0x20774f)
    #29 caml_callback_exn runtime/callback.c:201 (main.exe+0x1c30eb)
    #30 domain_thread_func runtime/domain.c:1215 (main.exe+0x1c8390)

My application code is just decrementing a counter in a record. It seems reasonable to me that the GC in another domain might be scanning that field at the same time. I assume this should be suppressed somehow.

The test-case doesn't trigger very reliably for me, but for reference here is the code:

module Switch = Eio__core__Switch
module Fiber = Eio__core__Fiber
module Backend = Eio_mock.Backend

let run_worker () =
  Eio_mock.Backend.run @@ fun () ->
  Switch.run ~name:"run_worker" @@ fun sw ->
  while true do
    Fiber.fork ~sw (fun () ->
        for _ = 0 to 1000 do
          ignore (String.make 10000 'a')
        done
      );
  done

let domains = ref []

let () =
  try
    Backend.run @@ fun () ->
    Switch.run @@ fun sw ->
    for _ = 1 to 7 do
      Fiber.fork ~sw (fun () ->
          Eio.Private.Suspend.enter "run-domain" (fun _ctx k ->
              let d = (Domain.spawn @@ fun () ->
                       Fun.protect run_worker ~finally:(fun () -> if false then k (Ok ()))
                      )
              in
              domains := d :: !domains
            );
        )
    done;
    for _ = 1 to 10000 do
      Fiber.fork ~sw (fun () ->
          for _ = 1 to 1000 do
            ignore (Sys.opaque_identity ())
          done
        )
    done
  with Backend.Deadlock_detected -> print_endline "Done"

This is using the "mock" backend to simplify things (the mock backend avoids most C stubs and OS interactions).

I did capture an rr recording of tsan reporting the warning if that's useful.

(original report from @avsm in https://github.com/ocaml-multicore/eio/issues/751)

OlivierNicole commented 1 week ago

This does look like a false positive. Or, rather, like a case where our definition of mixed model data races breaks down (see the last paragraph of this section of the manual). @fabbing and I made it so that a non-atomic access in OCaml code can race with an atomic access from the runtime (in this case, a relaxed atomic / volatile read from the GC).

It looks like you just uncovered a legitimate case where such accesses can happen concurrently.

Now, this should not have any real consequences, because in our mental model (for lack of a formalized framework), non-atomic accesses in OCaml can be considered equivalent to relaxed atomics in C. TSan support is a compromise and I’m not sure we can avoid this report without generating false negatives. So yes, my advice would be to probably suppress it.

gasche commented 1 week ago

It's not clear to me why there is a race: line 895 in major_gc.c is

https://github.com/ocaml/ocaml/blob/74a76cc2f2588eb18a443a60918f040838292ac7/runtime/major_gc.c#L895

so it reads block using the Field macro, but Field casts its argument into a volatile value * before its dereference:

https://github.com/ocaml/ocaml/blob/74a76cc2f2588eb18a443a60918f040838292ac7/runtime/caml/mlvalues.h#L231

I thought that this volatile value * cast should suffice to avoid having this read participate in data races. What am I missing?

OlivierNicole commented 1 week ago

I thought that this volatile value * cast should suffice to avoid having this read participate in data races. What am I missing?

It will not race with atomic writes but can still race with non-atomic ones, which is the case of the other access here. It’s a non-atomic write from OCaml, which is TSan-instrumented as a plain load.