publish.ml memory model failure in bytecode on Windows

dra27 commented 1 year ago

tests/memory-model/publish.ml is failing too frequently in AppVeyor. I think we should disable the test, so this issue is to track that fact (if necessary, the stale bot can act as a trigger to run more tests manually).

I've done some limited bulk runs of the test on two of my machines. On my laptop (Intel i7-8650U) I can relatively easily get this failure in bytecode (the most recent took 152 runs):

@@ -1,4 +1,5 @@
 Observation Publish+intref Never
-Observation Publish+string Never
+Observation Publish+string Sometimes
+Invalid behaviour on test Publish+string
 Observation Publish+intref Never
 Observation Publish+string Never

I ran the test for several hours with just the native code version (> 500 runs) and did not get a failure.

On my desktop (AMD Threadripper 3990X), I left it running the test overnight (> 700 runs) and haven't had a crash in either bytecode or native code.

AppVeyor seems to be getting a segfault rather than the test failure I was seeing. In all these instances, OCAML_TEST_SIZE should be unset, so these should be executed 2 core versions of the test.

gasche commented 1 year ago

Why is the test failing? Is this a bug with the Windows implementation, or more generally an OCaml bug?

The test code is cryptic, but it is here:

https://github.com/ocaml/ocaml/blob/23bf9056b9ec79a1675ab57cb24dcb64e6c9c646/testsuite/tests/memory-model/publish.ml#L125-L137

My understanding is that, in this test, one domain repeatedly updates a reference pointing to a string, while another domain accesses the reference and checks that the observed string is "valid" (one the values that the first domain writes).

This failure does look like something bad to me (violating my expectation about concurrent programming in OCaml).

I have no clue, but my guess as what is happening would be that some string-construction functions are internally using in-place update on a bytes value before "releasing" the value as an immutable, shared string value, and that we would need a barrier at the time of the release (so in Bytes.unsafe_to_string I guess?) to ensure that these updates are not observable/delayed from the point of view of other domains.

gasche commented 1 year ago

(cc @maranget I guess, who wrote the memory-model tests in the first place.)

dra27 commented 1 year ago

I'm curious to see whether MSVC can trigger this as well - the purpose in opening the tracking issue is to let all the other Windows core developers loose on the problem, while stopping Windows CI being completely ignored on other PRs 🙂

maranget commented 1 year ago

Hi @dra27, it is not easy for me to run the test on windows. Would you mind attempting to reproduce the problem, running the test directly with verbose output enabled?

More precisely, you can compile the test with make byte and then run it as follows:

% ../../../boot/ocamlrun ./publish.byt  -v
...
Test Publish+string Forbidden
Histogram (8 states)
   99914:> r0=*null*
      27:> r0=0
       7:> r0=1
       5:> r0=2
       2:> r0=3
       3:> r0=4
       3:> r0=6
      39:> r0=7
Observation Publish+string Never 0 100000
...

If your machine has N cores, it may be interesting to add option -aN.

Having a verbose output may help. Thanks.

dra27 commented 1 year ago

No problem at all, @maranget! Here's with the default (-a 2):

Test Publish+intref Forbidden
Histogram (2 states)
   39637:> q=p; r0=1;
   60363:> q=null; r0=-1;
Observation Publish+intref Never 0 100000

Test Publish+string Forbidden
Histogram (9 states)
   99445:> r0=*null*
     239:> r0=0
      75:> r0=1
      72:> r0=2
      77:> r0=3
      31:> r0=4
      11:> r0=5
       7*> r0=
      43:> r0=7
Observation Publish+string Sometimes 7 99993
Invalid behaviour on test Publish+string

Test MP+addr Allowed
Histogram (3 states)
   50123:> idx=0; v=0;
     359:> idx=0; v=1;
   49518:> idx=1; v=1;
Observation MP+addr Never
            0 100000

and here's with -a 4 (that's the core count for the i7-8650U):

Test Publish+intref Forbidden
Histogram (2 states)
   75567:> q=p; r0=1;
  124433:> q=null; r0=-1;
Observation Publish+intref Never 0 200000

Test Publish+string Forbidden
Histogram (12 states)
  198557:> r0=*null*
     876:> r0=0
     156:> r0=1
     131:> r0=2
     103:> r0=3
       7:> r0=4
       4:> r0=5
       4*> r0=
      63:> r0=4
      20:> r0=5
       9:> r0=6
      70:> r0=7
Observation Publish+string Sometimes 4 199996
Invalid behaviour on test Publish+string

Test MP+addr Allowed
Histogram (3 states)
  100046:> idx=0; v=0;
    6218:> idx=0; v=1;
   93736:> idx=1; v=1;
Observation MP+addr Never
            0 200000

Interestingly, at -a 8 (that's the thread count for the i7-8650U) I just managed to get a segfault, as seen on AppVeyor.

gasche commented 1 year ago

For the record:

the faulty code is using Printf.sprintf "%d" which does not use Bytes.unsafe_to_string, it uses Bytes.content which performs a copy of the bytes into a string using Bytes.sub_string.
I tried to reproduce the failure with a simpler, handcrafted program, but failed -- I suspect that my machine's memory model is not weak enough to observe any potential issue.

My repro code was as follows.

(* non-atomic references *)
module Ref = struct
  let make v = ref v
  let get r = !r
  let set r v = r := v
end
(* (* to get atomic references, comment above and uncomment below *)
module Ref = Atomic
*)

let digits = List.init 10 (fun digit ->
  Char.chr (digit + Char.code '0'))

let msg_len = 32
let values = List.map (fun c -> String.make msg_len c) digits

let valid =
  let module SSet = Set.Make(String) in
  let values = SSet.of_list values in
  fun s -> SSet.mem s values

let shared = Ref.make (List.hd values)
let producing = Ref.make true

let runs = 100_000

let producer () =
  for _ = 1 to runs do
    digits |> List.iter @@ fun digit ->
    let msg = String.make msg_len digit in
    let copy = String.sub msg 0 msg_len in
    Ref.set shared copy
  done;
  Ref.set producing false

let consumer () =
  while Ref.get producing do
    let msg = Ref.get shared in
    assert (valid msg);
    Domain.cpu_relax ();
  done

let () =
  let doms =
    List.init (Domain.recommended_domain_count () - 1)
      (fun _ -> Domain.spawn consumer) in
  producer ();
  List.iter Domain.join doms;

I sort of hoped to observe an assert failure, but did not.

(The test uses a very short msg_len of basically 1, but I thought that a larger value would make issues more likely.)

xavierleroy commented 1 year ago

It looks like you're suspecting weak memory model issues to be the cause for this issue, but from a discussion yesterday with @maranget :

the "store release" in caml_modify should suffice to guarantee that the string is completely filled by the time it's fetched from the reference, no matter how it is produced;
on x86, even plain stores give this guarantee (TSO model).

gasche commented 1 year ago

@maranget is currently jumping up and down in my temporary office, excited at the idea that memmove (used in String.sub) may use "non-temporal writes". His cavern of x86 memory behaviors unfortunately only contains NT-NT tests, not a non-temporal write followed by a normal write, so he has new tests to write during the night.

xavierleroy commented 1 year ago

memmove (used in String.sub) may use "non-temporal writes"

The plot thickens... I'm afraid we're about to conclude that the C standard library is unusable in the (Multicore) OCaml runtime system and we need to rewrite everything ourselves...

xavierleroy commented 1 year ago

But let's not forget the possibility that this is a "plain" concurrency bug somewhere else in the runtime system, of the kind that can happen in an SC execution.

maranget commented 1 year ago

Primitive memmove using non-temporal writes may explain the observed behaviour. Namely, the following test is observed on some x86_64 machine:

X86_64 MP+NT+TT
{
0:rax=x; 0:rcx=y;
}
 P0                  | P1            ;
 movl $1,%ebx        | movl (y),%eax ;
 movntil %ebx,(%rax) | movl (x),%ebx ;
 movl $1,%edx        |               ;
 movl %edx,(%rcx)    |               ;
exists (1:rax=1 /\ 1:rbx=0)

Said otherwise, a non-temporal write (possibly issued by memmove) and an ordinary write (issued by caml_modify) can be observed in reverse order.

dra27 commented 1 year ago

I've managed to trigger the segfault on the same machine, but using the msvc64 port instead. I was rather hoping the Just-in-Time debugger would attach properly, but it didn't so it's running again. Anyhow, what's interesting from that is mingw-w64 uses a different, much older CRT from the msvc64 port, which is using UCRT.

There are two possible things going in - with -a 8 the first failure does seem to be a segmentation fault, where at lower values the program is terminating normally. Trying the equivalent on my desktop (with -a 64, therefore) takes ages, but that run terminated with no output, but exit code -1.

Were I a gambler, I think I'd be putting my chips on a concurrency bug, too 🤔

gasche commented 1 year ago

Can you find the implementation of memmove on the CRT versions that you are using?

dra27 commented 1 year ago

Yes for the msvc64 one (turns out that's actually in vcruntime rather than UCRT :shrug:) - it's in C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.35.32124\crt\src\x64\memcpy.asm on my installation. I don't think I have the amd64 of the equivalent file for the mingw-w64 case.

dra27 commented 1 year ago

Small update. I've altered the test to run itself infinitely many times, which finally allowed me to attach a debugger to the segfaulting version. The crash is at mark_slice_darken which manages to be called with child = (value)NULL from do_some_marking - it appears therefore that somehow a NULL pointer has made it to the start field of the mark stack entry. As far as I can tell, this would happen via a call to mark_stack_push_range. I'm seeing if the crash can be still be triggered with the debug runtime, as it looks like there should be enough assertions to prevent that value from ever getting onto the mark stack?

github-actions[bot] commented 8 months ago

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Octachron commented 8 months ago

@dra27, what is the current status of this bug? (I am liberally applying my policies that the stale bot should not close bugs without human intervention).

gasche commented 8 months ago

My understanding is that the bug is still failing randomly (see #12425 for the list of known CI failure, the last one was three weeks ago), but we have no clue why and how to approach the issue.

dra27 commented 8 months ago

This issue was about publish.ml failing - but that most recent failure is the forbidden.ml test. For publish.ml, at the time we were seeing two failures - either the test failing, or a segfault. The location of the crash when segfaulting at least makes me wonder if #12525 is related, although that feels unlikely.

Offering pointers, as I don't have the time myself at the moment to keep digging with this:

It'd be worthwhile, though tedious, to trawl the AppVeyor logs and update this issue and #12425 with failures (and which kind - the test actually failing, or the test actually segfaulting)
Likewise for Jenkins (I've got a lot of logs in my email, if they're not available for download from elsewhere).
When testing this in a loop until it segfaulted, it was useful to be able to drop into the debugger to get some kind of information on the actual crash. I've not managed (but also not tried very hard) to use gdb for this on Windows, but it's easier to do with MSVC. This may therefore be worth returning to after the MSVC merge, when it's easier to give repro intructions for the segfault.

@maranget, @gasche and I had a brief private thread looking at the content of memcpy.asm (the thread was offline because it's not clear that we can post snippets of the Microsoft CRT sources online without violating its copyright), which similarly may be worth revisiting post-MSVC merge.

ocaml / ocaml

publish.ml memory model failure in bytecode on Windows #11853