Open dra27 opened 1 year ago
Why is the test failing? Is this a bug with the Windows implementation, or more generally an OCaml bug?
The test code is cryptic, but it is here:
My understanding is that, in this test, one domain repeatedly updates a reference pointing to a string, while another domain accesses the reference and checks that the observed string is "valid" (one the values that the first domain writes).
This failure does look like something bad to me (violating my expectation about concurrent programming in OCaml).
I have no clue, but my guess as what is happening would be that some string-construction functions are internally using in-place update on a bytes
value before "releasing" the value as an immutable, shared string value, and that we would need a barrier at the time of the release (so in Bytes.unsafe_to_string
I guess?) to ensure that these updates are not observable/delayed from the point of view of other domains.
(cc @maranget I guess, who wrote the memory-model tests in the first place.)
I'm curious to see whether MSVC can trigger this as well - the purpose in opening the tracking issue is to let all the other Windows core developers loose on the problem, while stopping Windows CI being completely ignored on other PRs 🙂
Hi @dra27, it is not easy for me to run the test on windows. Would you mind attempting to reproduce the problem, running the test directly with verbose output enabled?
More precisely, you can compile the test with make byte
and then run it as follows:
% ../../../boot/ocamlrun ./publish.byt -v
...
Test Publish+string Forbidden
Histogram (8 states)
99914:> r0=*null*
27:> r0=0
7:> r0=1
5:> r0=2
2:> r0=3
3:> r0=4
3:> r0=6
39:> r0=7
Observation Publish+string Never 0 100000
...
If your machine has N cores, it may be interesting to add option -a
N.
Having a verbose output may help. Thanks.
No problem at all, @maranget! Here's with the default (-a 2
):
Test Publish+intref Forbidden
Histogram (2 states)
39637:> q=p; r0=1;
60363:> q=null; r0=-1;
Observation Publish+intref Never 0 100000
Test Publish+string Forbidden
Histogram (9 states)
99445:> r0=*null*
239:> r0=0
75:> r0=1
72:> r0=2
77:> r0=3
31:> r0=4
11:> r0=5
7*> r0=
43:> r0=7
Observation Publish+string Sometimes 7 99993
Invalid behaviour on test Publish+string
Test MP+addr Allowed
Histogram (3 states)
50123:> idx=0; v=0;
359:> idx=0; v=1;
49518:> idx=1; v=1;
Observation MP+addr Never
0 100000
and here's with -a 4
(that's the core count for the i7-8650U):
Test Publish+intref Forbidden
Histogram (2 states)
75567:> q=p; r0=1;
124433:> q=null; r0=-1;
Observation Publish+intref Never 0 200000
Test Publish+string Forbidden
Histogram (12 states)
198557:> r0=*null*
876:> r0=0
156:> r0=1
131:> r0=2
103:> r0=3
7:> r0=4
4:> r0=5
4*> r0=
63:> r0=4
20:> r0=5
9:> r0=6
70:> r0=7
Observation Publish+string Sometimes 4 199996
Invalid behaviour on test Publish+string
Test MP+addr Allowed
Histogram (3 states)
100046:> idx=0; v=0;
6218:> idx=0; v=1;
93736:> idx=1; v=1;
Observation MP+addr Never
0 200000
Interestingly, at -a 8
(that's the thread count for the i7-8650U) I just managed to get a segfault, as seen on AppVeyor.
For the record:
Printf.sprintf "%d"
which does not use Bytes.unsafe_to_string, it uses Bytes.content
which performs a copy of the bytes into a string using Bytes.sub_string
.My repro code was as follows.
(* non-atomic references *)
module Ref = struct
let make v = ref v
let get r = !r
let set r v = r := v
end
(* (* to get atomic references, comment above and uncomment below *)
module Ref = Atomic
*)
let digits = List.init 10 (fun digit ->
Char.chr (digit + Char.code '0'))
let msg_len = 32
let values = List.map (fun c -> String.make msg_len c) digits
let valid =
let module SSet = Set.Make(String) in
let values = SSet.of_list values in
fun s -> SSet.mem s values
let shared = Ref.make (List.hd values)
let producing = Ref.make true
let runs = 100_000
let producer () =
for _ = 1 to runs do
digits |> List.iter @@ fun digit ->
let msg = String.make msg_len digit in
let copy = String.sub msg 0 msg_len in
Ref.set shared copy
done;
Ref.set producing false
let consumer () =
while Ref.get producing do
let msg = Ref.get shared in
assert (valid msg);
Domain.cpu_relax ();
done
let () =
let doms =
List.init (Domain.recommended_domain_count () - 1)
(fun _ -> Domain.spawn consumer) in
producer ();
List.iter Domain.join doms;
I sort of hoped to observe an assert failure, but did not.
(The test uses a very short msg_len
of basically 1
, but I thought that a larger value would make issues more likely.)
It looks like you're suspecting weak memory model issues to be the cause for this issue, but from a discussion yesterday with @maranget :
caml_modify
should suffice to guarantee that the string is completely filled by the time it's fetched from the reference, no matter how it is produced;@maranget is currently jumping up and down in my temporary office, excited at the idea that memmove
(used in String.sub
) may use "non-temporal writes". His cavern of x86 memory behaviors unfortunately only contains NT-NT tests, not a non-temporal write followed by a normal write, so he has new tests to write during the night.
memmove (used in String.sub) may use "non-temporal writes"
The plot thickens... I'm afraid we're about to conclude that the C standard library is unusable in the (Multicore) OCaml runtime system and we need to rewrite everything ourselves...
But let's not forget the possibility that this is a "plain" concurrency bug somewhere else in the runtime system, of the kind that can happen in an SC execution.
Primitive memmove
using non-temporal writes may explain the observed behaviour. Namely, the following test is observed on some x86_64 machine:
X86_64 MP+NT+TT
{
0:rax=x; 0:rcx=y;
}
P0 | P1 ;
movl $1,%ebx | movl (y),%eax ;
movntil %ebx,(%rax) | movl (x),%ebx ;
movl $1,%edx | ;
movl %edx,(%rcx) | ;
exists (1:rax=1 /\ 1:rbx=0)
Said otherwise, a non-temporal write (possibly issued by memmove) and an ordinary write (issued by caml_modify
) can be observed in reverse order.
I've managed to trigger the segfault on the same machine, but using the msvc64 port instead. I was rather hoping the Just-in-Time debugger would attach properly, but it didn't so it's running again. Anyhow, what's interesting from that is mingw-w64 uses a different, much older CRT from the msvc64 port, which is using UCRT.
There are two possible things going in - with -a 8
the first failure does seem to be a segmentation fault, where at lower values the program is terminating normally. Trying the equivalent on my desktop (with -a 64
, therefore) takes ages, but that run terminated with no output, but exit code -1.
Were I a gambler, I think I'd be putting my chips on a concurrency bug, too 🤔
Can you find the implementation of memmove
on the CRT versions that you are using?
Yes for the msvc64 one (turns out that's actually in vcruntime rather than UCRT :shrug:) - it's in C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.35.32124\crt\src\x64\memcpy.asm
on my installation. I don't think I have the amd64 of the equivalent file for the mingw-w64 case.
Small update. I've altered the test to run itself infinitely many times, which finally allowed me to attach a debugger to the segfaulting version. The crash is at mark_slice_darken
which manages to be called with child = (value)NULL
from do_some_marking
- it appears therefore that somehow a NULL pointer has made it to the start
field of the mark stack entry. As far as I can tell, this would happen via a call to mark_stack_push_range
. I'm seeing if the crash can be still be triggered with the debug runtime, as it looks like there should be enough assertions to prevent that value from ever getting onto the mark stack?
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.
@dra27, what is the current status of this bug? (I am liberally applying my policies that the stale bot should not close bugs without human intervention).
My understanding is that the bug is still failing randomly (see #12425 for the list of known CI failure, the last one was three weeks ago), but we have no clue why and how to approach the issue.
This issue was about publish.ml
failing - but that most recent failure is the forbidden.ml
test. For publish.ml
, at the time we were seeing two failures - either the test failing, or a segfault. The location of the crash when segfaulting at least makes me wonder if #12525 is related, although that feels unlikely.
Offering pointers, as I don't have the time myself at the moment to keep digging with this:
@maranget, @gasche and I had a brief private thread looking at the content of memcpy.asm
(the thread was offline because it's not clear that we can post snippets of the Microsoft CRT sources online without violating its copyright), which similarly may be worth revisiting post-MSVC merge.
tests/memory-model/publish.ml
is failing too frequently in AppVeyor. I think we should disable the test, so this issue is to track that fact (if necessary, the stale bot can act as a trigger to run more tests manually).I've done some limited bulk runs of the test on two of my machines. On my laptop (Intel i7-8650U) I can relatively easily get this failure in bytecode (the most recent took 152 runs):
I ran the test for several hours with just the native code version (> 500 runs) and did not get a failure.
On my desktop (AMD Threadripper 3990X), I left it running the test overnight (> 700 runs) and haven't had a crash in either bytecode or native code.
AppVeyor seems to be getting a segfault rather than the test failure I was seeing. In all these instances,
OCAML_TEST_SIZE
should be unset, so these should be executed 2 core versions of the test.