Open avsm opened 2 months ago
The fact that it's racing with a read by the GC suggests these tsan reports are both false positives. I would expect it's perfectly reasonable for one domain to be examining the block's fields during marking in a major slice while another one is updating the count.
@OlivierNicole might know for sure.
Does it actually crash, or just give warnings from tsan? (on my machine, I only see the warnings)
I'm trying to reproduce the crash but have only gotten it once on the bigger program from which I extracted this testcase
ThreadSanitizer:DEADLYSIGNAL
==800191==ERROR: ThreadSanitizer: SEGV on unknown address 0x557e838d93a4 (pc 0x557e83a6f782 bp 0x722c0001ff80 sp 0x7f5624dff160 T800204)
==800191==The signal is caused by a WRITE memory access.
#0 oldify_one runtime/minor_gc.c:252 (main.exe+0x413782) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#1 caml_empty_minor_heap_promote runtime/minor_gc.c:561 (main.exe+0x41495a) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#2 caml_stw_empty_minor_heap_no_major_slice runtime/minor_gc.c:792 (main.exe+0x414fc0) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#3 caml_stw_empty_minor_heap runtime/minor_gc.c:823 (main.exe+0x4152b6) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#4 stw_handler runtime/domain.c:1486 (main.exe+0x3eb508) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#5 handle_incoming runtime/domain.c:351 (main.exe+0x3eb508)
#6 caml_handle_incoming_interrupts runtime/domain.c:364 (main.exe+0x3ec18b) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#7 caml_handle_gc_interrupt runtime/domain.c:1897 (main.exe+0x3ec18b)
#8 caml_do_pending_actions_res runtime/signals.c:338 (main.exe+0x41e06c) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#9 caml_alloc_small_dispatch runtime/minor_gc.c:896 (main.exe+0x415514) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#10 caml_garbage_collection runtime/signals_nat.c:86 (main.exe+0x42dd78) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#11 caml_call_gc <null> (main.exe+0x42a165) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#12 camlBase64_rfc2045.pp_base64_761 src/base64_rfc2045.ml:252 (main.exe+0x2ba8a6) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#13 camlMrmime__B64.parser_898 src/base64_rfc2045.ml:278 (main.exe+0x27d255) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#14 camlMrmime__B64.parser_898 lib/b64.ml:98 (main.exe+0x27d3a3) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#15 camlMrmime__B64.parser_898 lib/b64.ml:98 (main.exe+0x27d3a3) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#16 camlMrmime__B64.parser_898 lib/b64.ml:98 (main.exe+0x27d3a3) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
#17 camlMrmime__B64.parser_898 lib/b64.ml:98 (main.exe+0x27d3a3) (BuildId: 15be3c35d13b78db78a598b67ef2ce684d8ad79d)
Here's a simpler way to trigger the tsan error (no IO, works on Linux and Posix backends):
open Eio.Std
let run_worker () =
Switch.run ~name:"run_worker" @@ fun sw ->
while true do
Fiber.fork ~sw (fun () ->
for _ = 0 to 1000 do
ignore (String.make 10000 'a')
done
);
done
let () =
Eio_main.run @@ fun env ->
Eio.Switch.run @@ fun sw ->
let domain_mgr = Eio.Stdenv.domain_mgr env in
for _ = 1 to 7 do
Fiber.fork_daemon ~sw (fun () -> Eio.Domain_manager.run domain_mgr run_worker)
done;
for _ = 1 to 10000 do
Fiber.fork ~sw (fun () ->
for _ = 1 to 1000 do
ignore (Sys.opaque_identity ())
done
)
done
Still trying to track down exactly what is causing this, but it triggers now with a small testcase in 5.2.0 and 5.3.0+trunk on x86_64 with the following program:
built with:
It triggers the following data race reliably on Linux:
It happens with both the uring and posix backend