A miscompilation bug in LICMPass (concurrency)

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.41k stars 12.15k forks source link

A miscompilation bug in LICMPass (concurrency) #64188

Open sunghwanl opened 1 year ago

sunghwanl commented 1 year ago

A1:  int* bug(int N, atomic<int>& a) {
A2:    int* p = (int*)malloc(sizeof(int));
A3:    
A4:    for (int i = 1; i <= N; i++) {
A5:      *p = i;
A6:      foo(i, a);
A7:    }
A8:  
A9:    return p;
A10: }

LICMPass of Clang++ (13.0.0+) optimizes the above function into the following (C++ syntax is used for brevity, see godbolt for the IR code):

B1:  int* bug(int N, atomic<int>& a) {
B2:    int* p = (int*)malloc(sizeof(int));
B3:    
B4:    for (int i = 1; i <= N; i++) {
B5:      foo(i, a);
B6:    }
B7:  
B8:    *p = N;
B9:  
B10:   return p;
B11: }

This is a miscompilation bug. By compiling and running a concurrent program that involves the bug function on Armv8 processors Apple M1 and Cavium ThunderX2, we actually observed certain program behaviors forbidden by the original program before the optimization. The actual program we tested can be found here. (The forbidden behavior can be observed couple of times throughout millions~billions of program runs. The test code includes some intricate crafts, e.g., that helps repeat the test many times.)

In the following, we explain a simplified version of the concurrent program we used and show why the above LICM is a miscompilation bug.

To see a buggy execution, let's consider two threads where the first thread calls the bug function.

Thread 1:

T1.1: void thread1(int N, atomic<int>& a, atomic<int*>& X) {
T1.2:   int* p = bug(N, a);
T1.3:   X.store(p, memory_order_relaxed);
T1.4: }

Thread 2:

T2.1: void thread2(int N, atomic<int>& a, atomic<int*>& X) {
T2.2:   while (a.load(memory_order_acquire) < N);
T2.3:   int *p = X.load(memory_order_relaxed);
T2.4:   if (p != NULL) {
T2.5:     if (*p != N) {
T2.6:       cout << "bug!!!" << endl;
T2.7:     }
T2.8:   }
T2.9: }

Also suppose that the foo function is given as follows:

F1: void foo(int i, atomic<int>& a) {
F2:   a.store(i, memory_order_release);
F3: }

Now, if we create two threads as follows, the program should never print bug!!! before the problematic LICM in the bug function.

atomic<int> a;
atomic<int*> X;
int N = 1;
a.store(0, memory_order_relaxed);
X.store(NULL, memory_order_relaxed);
thread* th1 = new thread(thread1, N, ref(a), ref(X));
thread* th2 = new thread(thread2, N, ref(a), ref(X));

Indeed, once thread2 reads N from a in an acquire access mode (line T2.2), the last call to the foo function (line A6, writing N to a with a release mode) synchronizes with the acquire read. Therefore, the last write to p (line A5, *p = N) by thread1 happens-before the read from *p (line T2.5, *p != N) by thread2, and thread2 can only read N from *p. Note that there is no data race in this program (before the LICM).

However, after the above LICM, the program becomes racy on *p, and thread2 can read a value other than N from *p. We actually observed this program printing "bug!!!" by compiling and running it on Armv8 processors Apple M1 and ThunderX2!

Roughly, the above LICM is wrong because it moves a non-atomic write *p = i (line A5) after a release write (line A6). This is wrong even though p is never leaked before it is returned. Specifically, LLVM applies the above LICM and inlines the bug function in thread1, resulting in the following code:

    ...
O1: for (int i = 1; i <= N; i++) {
O2:   foo(i, a);
O3: }
O4:
O5: *p = N;
O6: X.store(p, memory_order_relaxed);
    ...

Now, the write to p (line O5) is not properly synchronized anymore, and thread2 can read an uninitialized value from p (line T2.5). In particular, such behavior is allowed by the Arm architecture because it can reorder the two independent stores O5 and O6. Therefore, this program can print "bug!!!", which is originally forbidden.

We conclude that the pointer p in the function bug should not be treated as completely local even before it is leaked (line A9).

nikic commented 1 year ago

cc @efriedma-quic

efriedma-quic commented 1 year ago

This is basically the same issue discussed in the review of https://reviews.llvm.org/D153932 ; our escape analysis doesn't properly account for atomic fences. (See also https://reviews.llvm.org/D137707.) I'm still a little surprised to see it actually show up like this. CC @jdoerfert @aeubanks

nikic commented 1 year ago

The example here doesn't use fences though. Is the example valid as given?

If this is just about fences, then it goes straight on the "technically a bug, but we're not going to fix it" pile as far as I'm concerned.

efriedma-quic commented 1 year ago

I didn't mean "fence" instructions specifically; that's just what I was thinking about looking at the patch I mentioned. The example is valid, as far as I can tell.

The key construct here is the "relaxed" store of the address (X.store(p, memory_order_relaxed);). This allows the pointer to reach another thread without a happens-before edge. So the ordering has to be enforced some other way... but the store/load of "a" is enough to enforce ordering according to the standard. The location of the store of the pointer in the code doesn't matter; it could be before or after the "release" operation that enforces the ordering.

Note that if the store of the address is not atomic, another barrier would be necessary, and if the store of the address is "release", it naturally enforces its own ordering. So the miscompile is only visible with a "relaxed" store of newly allocated pointer, I think.

As a practical matter, it seems relatively unlikely anyone would stumble over this by accident; lockless datastructures that allocate memory would normally either use a "release" store of the pointer, or use some other form of locking.

I'm a little uncomfortable assuming nobody will use atomics this way, but the potential performance regressions from restricting escape analysis around calls are also sort of nasty.

jdoerfert commented 1 year ago

I'm a little uncomfortable assuming nobody will use atomics this way, but the potential performance regressions from restricting escape analysis around calls are also sort of nasty.

For LICM, we'd need to do a peel of one iteration for this to be handled. FWIW, atomics do not have to show anywhere for this to appear. I'm not sure if the situation is really common enough to show up if we forbid LICM to do this if the pointer of the final store escapes later.

sunghwanl commented 1 year ago

An ad hoc remedy for this particular example is to move *p = i before the loop instead of moving it after. It is safe to move *p = i before foo even if foo contains some acquire accesses/fences. In general, I would say we should avoid moving memory accesses after unknown functions (or release accesses/fences) if the address being accessed can possibly be leaked in some continuation.

RalfJung commented 9 months ago

In general, I would say we should avoid moving memory accesses after unknown functions (or release accesses/fences) if the address being accessed can possibly be leaked in some continuation.

This could be made legal by adding a release fence before the address gets leaked, right? So this kind of code motion becomes a trade-off, where the cost of the fence is weighed against the benefit of moving the stores around.

davidtgoldblatt commented 8 months ago

This isn't really about LICM in my opinion; it's some sort of combination provenance + concurrency issue (alias analysis will tell you that your locals haven't escaped before you pass them to extern functions; but this isn't true in a world where concurrency can cause time travel).

Brief example with no loops: https://gcc.godbolt.org/z/4qKs4TYG3

I think the right fix for this is at the language level rather than LLVM though (and that the language should retroactively bless LLVM's behavior); assuming non-escaped locals don't alias random pointers passed to you seems like it must be fundamental.

Edit: fixed typo in godbolt link

efriedma-quic commented 8 months ago

The new testcase is interesting because there is no release barrier... which means no other thread can access the variable "local". But the thread that allocated it is fine, I guess? There's a happens-before relationship between the accesses (and the deallocation) because they're on the same thread.

I can imagine a language rule somehow trying to address this, but it would be weird. We'd have to divide up memory into "single-threaded" and "multi-threaded" variables. A variable starts out "single-threaded", then becomes "multi-threaded" when its address is stored to memory visible to other threads (or the memory where the address is stored becomes visible to other threads, or the address is converted to an integer). And then we require a release-acquire edge between the store of the address, and any load that accesses the address. Or something along those lines; the key here is that "publishing" the address of a variable becomes a special operation.

The current C++ model, by contrast, only requires that there's a release-acquire edge between the allocation/initialization of the variable and any accesses to the variable's value from another thread; the address itself is irrelevant.

This could be made legal by adding a release fence before the address gets leaked, right?

This doesn't help; a "release" only has an effect if there's an "acquire" on the other side.

davidtgoldblatt commented 8 months ago

I can imagine a language rule somehow trying to address this, but it would be weird. We'd have to divide up memory into "single-threaded" and "multi-threaded" variables. A variable starts out "single-threaded", then becomes "multi-threaded" when its address is stored to memory visible to other threads (or the memory where the address is stored becomes visible to other threads, or the address is converted to an integer). And then we require a release-acquire edge between the store of the address, and any load that accesses the address. Or something along those lines; the key here is that "publishing" the address of a variable becomes a special operation.

The thing I was thinking of is applying something like a happens-before rule not just to the accesses, but also to pointers' provenances. So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model.

FWIW, the C++ committee concurrency study group will talk about this in the next couple of days and see if anyone has anything clever to say.

efriedma-quic commented 8 months ago

The simplest way to "solve" the issue is just to forbid relaxed loads/stores of pointers: without those, you can't do anything weird (ignoring inttoptr conversions; not sure how we'd handle those).

Saying that relaxed loads/stores of pointers don't have provenance is basically the same thing as forbidding them: a pointer without provenance is just an integer. If you want to allow them, you need some way to transfer the provenance alongside the pointer.

The notion of "publishing" is a way out of this conflict: it means you need a barrier at exactly one point, the point of first publication. After that, the provenance transfers like you'd normally expect. This minimizes the amount of code made undefined, and keeps existing compiler optimizations working.

hboehm commented 8 months ago

Some (evolving) slides for an upcoming C++ concurrency group presentations are here.

So far, I like David's suggestion, as I understand it. Can we paraphrase that as disallowing copying of a pointer unless the lifetime-start of the object it refers to happens-before the pointer is copied?

efriedma-quic commented 8 months ago

disallowing copying of a pointer unless the lifetime-start of the object it refers to happens-before the pointer is copied?

The C++ memory model already forbids accessing an object before it's created, or after it's destroyed.

The issue here is atomic operations that happen between the object being created and the object's address "escaping". LLVM optimizations assume such atomic operations are irrelevant... but the C++ memory model doesn't attach any special significance to the point where the pointer escapes. So the fix here is to make the point where the pointer escapes special: there has to be a happens-before edge between the escape point, and any use of the address in other threads.

RalfJung commented 8 months ago

@davidtgoldblatt isn't yours a completely different issue? It doesn't involve any release/acquire. It seems more like the example from this recent talk.

In your example, t2 should always return 1, no matter the value of allocated, shouldn't it? Except load buffering can "send values into the past" and that is incompatible with basic freshness assumptions. But this is a very different issue from the OP where no load buffering is involved. Or did I misunderstand something?

So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model.

Usually the provenance is part of the pointer value itself. So how would this work, where does the gap in the provenance hb-chain come from? Under the models of provenance I am aware of, a fresh provenance is generated by malloc, but that is happens-before the access in thread2. The issue is that the compiler kind of wants to not make that provenance part of the "release" in thread1... but how is that supposed to work in general, where I do need to rely on "release" publishing the entire contents of memory, including all its provenance?

@efriedma-quic

This could be made legal by adding a release fence before the address gets leaked, right?

This doesn't help; a "release" only has an effect if there's an "acquire" on the other side.

Ah, I was fooled by the acquire in thread2 in the original example. The publishing is via X which is only used relaxed... So what happens is that the data at *p gets published via a and release/acquire, but the value of p itself gets published later via relaxed accesses -- the two are basically "taking different paths" to the other thread. Very cool, but also quite concerning.

The issue here is atomic operations that happen between the object being created and the object's address "escaping". LLVM optimizations assume such atomic operations are irrelevant... but the C++ memory model doesn't attach any special significance to the point where the pointer escapes. So the fix here is to make the point where the pointer escapes special: there has to be a happens-before edge between the escape point, and any use of the address in other threads.

The issue is that the "escape point" is not really a well-defined notion, is it? As compiler analyses get better and better, they may be able to push the "escape point" further and further down program execution. How did you plan to define "escape point"?

davidtgoldblatt commented 8 months ago

@davidtgoldblatt isn't yours a completely different issue? It doesn't involve any release/acquire. It seems more like the example from this recent talk.

In your example, t2 should always return 1, no matter the value of allocated, shouldn't it? Except load buffering can "send values into the past" and that is incompatible with basic freshness assumptions. But this is a very different issue from the OP where no load buffering is involved. Or did I misunderstand something?

I think my take is that the fundamental issue here is that the presence of concurrency can "move up" (in sequenced-before) the effective time when some thread's private-seeming data has escaped. That's done with load buffering in my example, but can also happen with store reordering (as in the example here). In this view, the release/acquire synchronization in the original example is a detail -- it's just a way to give defined behavior to our ability to detect that early-escape, but it's not really the cause.

So e.g. in the original example, we declare that at thread2's load of p, even though p's value was obtained via a valid happens-before chain, its provenance was not. This sort of captures the notion of publication in a way that meshes with the rest of the memory model.

Usually the provenance is part of the pointer value itself. So how would this work, where does the gap in the provenance hb-chain come from? Under the models of provenance I am aware of, a fresh provenance is generated by malloc, but that is happens-before the access in thread2. The issue is that the compiler kind of wants to not make that provenance part of the "release" in thread1... but how is that supposed to work in general, where I do need to rely on "release" publishing the entire contents of memory, including all its provenance?

I think that I would say something like "at T2.3, the load should return a pointer value lacking provenance, because the store of the pointer value does not happen before the load". This isn't quite workable (because of standalone fences), but something in that direction at least feels plausible to me as a language-level fix.

Maybe more succinctly: the root cause of all these issues is that some line of code gets a pointer value whose bits are equal to a valid pointer, but which they shouldn't be allowed to use. That's the same situation that we want provenance rules to solve; you have to come by your pointer values honestly. Use relaxed accesses to pass them to other threads is cheating, in the same way abusing the 1-past-the-end-of-an-array exception to get a pointer that happens to be equal to some other object's address is cheating.

RalfJung commented 8 months ago

Maybe more succinctly: the root cause of all these issues is that some line of code gets a pointer value whose bits are equal to a valid pointer, but which they shouldn't be allowed to use. That's the same situation that we want provenance rules to solve; you have to come by your pointer values honestly. Use relaxed accesses to pass them to other threads is cheating, in the same way abusing the 1-past-the-end-of-an-array exception to get a pointer that happens to be equal to some other object's address is cheating.

That makes sense, thanks. Though I do worry it can be fiendishly difficult in practice to reason about this... except ofc by just always using release-acquire when pointers are involved.^^

However, does that capture all cases of this issue? Here is another example, which I got from a colleague (Chung-Kil Hur):

<<Thread1>>

T1.1:   void thread1(atomic<int*>& X, atomic<int*>& Y) {
T1.2:     int* p = (int*)malloc(sizeof(int));
T1.3:     *p = 42;
T1.4:     rel_fence();
T1.5:     X.store(p, memory_order_relaxed);
T1.6:   }
T1.7:  void rel_fence() {
T1.8:    atomic_thread_fence(memory_order_release)
T1.9:  }

<<Thread2>>

T2.1:   void thread2(atomic<int*>& X) {
T2.2:     int*p = X.load(memory_order_relaxed);
T2.3:     atomic_thread_fence(memory_order_acquire);
T2.4:     if (p != NULL) {
T2.5:       if (*p != 42) {
T2.6:         cout << "bug!!!" << endl;
T2.7:       }
T2.8:     }
T2.9:   }

This should be equivalent to using release stores / acquire loads on X, we are following the usual pattern: "release fence then relaxed store" and "relaxed load then acquire fence". And yet, the compiler may want to move the *p = 42; down below the rel_fence() call, which would seem to break this program.

I guess you could say that at T2.2 there is no happens-before to T1.5 since relaxed doesn't induce HB -- but that would be basically saying fences don't work when doing the standard message passing idiom with pointers. Which seems bad?

I suppose that's what you meant when you said your proposal isn't quite workable because of fences.

efriedma-quic commented 8 months ago

The issue is that the "escape point" is not really a well-defined notion, is it? As compiler analyses get better and better, they may be able to push the "escape point" further and further down program execution. How did you plan to define "escape point"?

The point where the pointer becomes accessible to another thread. This is, as you note, basically impossible for the compiler to compute precisely; it's a dynamic property of the code at runtime. But the compiler doesn't need to precisely compute it; a conservative approximation is sufficient. Standard compiler escape analysis is such an approximation.

RalfJung commented 8 months ago

Even at runtime, this would be a very strange notion to introduce in the very definition of what is Defined Behavior and what not. How is it defined? A global scan of all memory transitively reachable through all pointers in a thread's stack? That's a hell of a property to reason about, if I want to justify precisely why my program is correct.

efriedma-quic commented 8 months ago

How is it defined? A global scan of all memory transitively reachable through all pointers in a thread's stack?

Not quite sure. :)

This sort of property isn't completely unprecedented; proposals for defining inttoptr are sort of similar.

RalfJung commented 8 months ago

The proposals for defining inttoptr that I am involved in (mostly on the Rust side) don't look anything like that. ;)

davidtgoldblatt commented 8 months ago

I suppose that's what you meant when you said your proposal isn't quite workable because of fences.

Yeah, exactly. (I mean, not limited to just fences of course -- any sort of opaque library synchronization could work too). But that at least feels like something we could put some words around, because now we have some synchronization primitive to hang it off of. Maybe introducing "provisional provenance"; it starts acting like empty provenance but can get promoted if certain synchronization conditions are met. Something like:

If there are atomic pointer store S of value V and atomic pointer load load L, where L takes its value from S[1], then if S happens before L, L obtains V. If S does not happen before L, then L obtains V', which has the same address as V but a provisional version of V's provenance. On dereference of V', if there is some evaluation P such that the object to which V refers' lifetime start happens before P, L happens before P, and P happens before the dereference, then the dereference acts as though it has V's provenance. Otherwise, it acts as though it has empty provenance.

The intuition being, we pretend that the provenance for some storage lives with the storage, being stored to at its lifetime start. At some unspecified point after we load a pointer value, we try to load its provenance from the storage and update our pointer value with it; these are the conditions we'd need to avoid UB in that case.

I think that this blesses LLVM's current behavior on the various weirdo test cases without rendering too much sane code UB.

[1] Not this exactly -- we probably need something about release sequences here [2] I'm being sloppy here with happens before vs simply happens before vs. strongly happens before. I think where I say happens before we mostly actually want simply happens before, but I haven't thought it through all the way.

I'm not 100% sure this could really be formalized because of the operational semantics / axiomatic semantics mismatch in provenance and the memory model, but OTOH we already have issues there anyways.

RalfJung commented 8 months ago

I'm not 100% sure this could really be formalized because of the operational semantics / axiomatic semantics mismatch in provenance and the memory model, but OTOH we already have issues there anyways.

There's also the "there is some evaluation P" quantification which doesn't seem a good fit for formalization -- it seems to require running hypothetical alternative executions. Can this be stated in terms of just the current execution and its memory model graphs?

comex commented 8 months ago

I think the right fix for this is at the language level rather than LLVM though (and that the language should retroactively bless LLVM's behavior); assuming non-escaped locals don't alias random pointers passed to you seems like it must be fundamental.

Is it fundamental, though? With your example, neither GCC nor MSVC performs the problematic optimization, though Clang and ICC do. If I change the relaxed store to an opaque function call, then ICC stops performing the optimization, leaving Clang as the odd one out.

In other words, other compilers seem to apply "non-escaped locals" optimizations only to locals that never escape, as opposed to identifying when each local escapes and applying the optimizations at points where they haven't escaped yet.

(In ICC's case, the relaxed store should be enough to block the optimization, but the fact an opaque call does block it suggests that the relaxed-store case may be a relatively localized bug in ICC, as opposed to proving that ICC follows a general policy of "aliasing can't happen until after the escape", as LLVM currently does. Maybe. ICC is closed source, so I'm only speculating.)

Now to speculate even more loosely:

If I had to guess what kind of scenario benefits most from this kind of optimization, I'd guess something like this: A long function contains a local that's manipulated many times, but mostly by value. However, at some point a pointer to it is passed to another, opaque function. In reality, the second function only accesses the pointer for the duration of the call (i.e. it could be nocapture), but LLVM doesn't know that. Ideally the local could be optimized for the rest of the function, e.g. by storing it in registers.

But even in such a scenario, LLVM's time-based escape analysis only provides a benefit until the first escape. After that, the local must be treated as forever aliased. That is a pretty limited benefit! It doesn't sound so bad to forgo it, especially if other C/C++ compilers already do so.

If this were Rust code, then under most circumstances, the compiler should be allowed to treat the local as non-aliased both before and after the call. The specifics are complicated, and the spec-permitted optimizations are mostly unimplemented, but if they were implemented, I believe they would render time-based escape analysis mostly redundant for Rust code.

davidtgoldblatt commented 8 months ago

There's also the "there is some evaluation P" quantification which doesn't seem a good fit for formalization -- it seems to require running hypothetical alternative executions. Can this be stated in terms of just the current execution and its memory model graphs?

This is just an artifact of the C++ standardese weasel wording we use to avoid being precise about where a memory or synchronization operation happens (see e.g. here ) -- not really meant as a quantification across multiple executions.

efriedma-quic commented 8 months ago

I should say, despite the discussion here, I'm not convinced modifying the standard is the right approach. The discussion here indicates that any modification we make would be confusing. And we don't know what the actual performance penalty is for just following the standard as it's currently written.

But even in such a scenario, LLVM's time-based escape analysis only provides a benefit until the first escape.

Yes. The relevant code is llvm::PointerMayBeCapturedBefore.

If this were Rust code, then under most circumstances, the compiler should be allowed to treat the local as non-aliased both before and after the call.

I don't see how local variables are any better off here. Sending a reference to another thread without a fence would require unsafe code, but it's not clear to me it would be undefined behavior.

I believe they would render time-based escape analysis mostly redundant for Rust code.

You can directly translate the original example to Rust by replacing "malloc" with "Box::new()".

davidtgoldblatt commented 8 months ago

I should say, despite the discussion here, I'm not convinced modifying the standard is the right approach. The discussion here indicates that any modification we make would be confusing. And we don't know what the actual performance penalty is for just following the standard as it's currently written.

Do you mean that in full generality, including the load-buffering-like examples? My impression was that everyone thought that those were too weird / harmful to be supported, while most people thought that the "establish release/acquire synchronization through some opaque function call" ones should remain correct per the standard (and that all we disagreed about was which set this example fell into).

efriedma-quic commented 8 months ago

Do you mean that in full generality, including the load-buffering-like examples?

You mean, specifically, the examples that don't involve any fences (https://github.com/llvm/llvm-project/issues/64188#issuecomment-1998918385)? I hadn't really thought about it before... but if the standard could change something there, it might be helpful. Specifically, to remove the happens-before relationship between loads/stores on the same thread if the pointer operand for one of the operations comes from another thread.

From the perspective of LLVM optimizations, it would be straightforward to make llvm::PointerMayBeCapturedBefore() treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape. That solves the cases involving fences, but not the cases that only involve relaxed load/store.

efriedma-quic commented 8 months ago

From the perspective of LLVM optimizations, it would be straightforward to make llvm::PointerMayBeCapturedBefore() treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape

In particular, we currently don't track whether a function contains any relaxed load/store ops, so if we need to track that, it would make the implementation a bit more complicated. The performance impact is 99% the same either way, though: the primary impact is treating unknown functions as an escape point for pointers that eventually escape.

davidtgoldblatt commented 8 months ago

You mean, specifically, the examples that don't involve any fences (#64188 (comment))? I hadn't really thought about it before... but if the standard could change something there, it might be helpful. Specifically, to remove the happens-before relationship between loads/stores on the same thread if the pointer operand for one of the operations comes from another thread.

I think yeah, we should should change the standard to make LLVM's current behavior allowed in those cases.

I also think, though, that this specific case is more like the ones you describe as not involving any fences than it seems like at first glance. I think the pattern we want to allow is the one you mentioned upthread, discussed in https://reviews.llvm.org/D153932 . In the usages everyone agrees should be allowed, the pattern is:

Publishing thread: P.1. initialize pointee P.2. Establish synchronization with consuming thread (maybe via a fence in some un-analyzable function) P.3. publish pointer

Consuming thread: C.1 Get pointer C.2 Establish synchronization for pointee C.3 Access pointee

In this example, though, the underlying pattern is: C1. Establish synchronization for pointee C2. Get pointer C3. Access pointee (no synchronization established between C2 and C3)

Which is the same root issue in some of the unsynchronized, relaxed-only examples we're looking at elsewhere. I'm OK jettisoning it to make alias analysis (in the sense of "your arguments don't alias storage you allocate") more viable.

In particular, we currently don't track whether a function contains any relaxed load/store ops, so if we need to track that, it would make the implementation a bit more complicated. The performance impact is 99% the same either way, though: the primary impact is treating unknown functions as an escape point for pointers that eventually escape.

It gets subtle fast -- I think you need to consider not just function calls, but things that happen after your function returns, so you can't just infer some nosync-variant in a bottom-up call graph traversal (without giving up on sret/noalias optimizations).

efriedma-quic commented 8 months ago

I was thinking that between the allocation of a variable, and the first atomic operation after the allocation, we could treat treat the variable as "non-escaped" relative to any other memory accesses. That said, in the fenceless variants, I guess it's possible to load the address of a variable before the variable is actually allocated. Which would cause problems for alias analysis.

In this example, though, the underlying pattern is: C1. Establish synchronization for pointee C2. Get pointer C3. Access pointee (no synchronization established between C2 and C3)

Which is the same root issue in some of the unsynchronized, relaxed-only examples we're looking at elsewhere. I'm OK jettisoning it to make alias analysis (in the sense of "your arguments don't alias storage you allocate") more viable.

The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it. For the cross-thread cases, the definition gets more complicated.

And even if we do manage to establish a definition, what have we actually accomplished? The handling required for cases like D153932 still basically requires treating unknown functions as escape points.

Permik commented 7 months ago

That said, in the fenceless variants, I guess it's possible to load the address of a variable before the variable is actually allocated. Which would cause problems for alias analysis.

But that would essentially be a volatile access in that case? No provenance, because the pointer was "incorrectly" created and all the bets are off. And if my understanding is correct, heavy de-optimization because you can't really optimize around volatile memory access.

davidtgoldblatt commented 6 months ago

The aforementioned provenance-y C++ paper is here, if anyone is curious: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html . I think probably to the people on this thread none of the content is new; but perhaps the framing may help focus some thinking.

(I don't really have a fix, but the ideal outcome in my mind is that things like this issue get fixed in the compiler, but things like pointers having escaped before their allocation get fixed in the standard, which I think most people agree with; the thing I'd like to narrow down on is the moral justification for a difference).

JonasOberhauser commented 6 months ago

Really interesting thread. I'm no expert so I hope if I share some thoughts you can point out what I misunderstood.

I was thinking that between the allocation of a variable, and the first atomic operation after the allocation, we could treat the variable as "non-escaped" relative to any other memory accesses.

I liked what you said before more, with the "treat function calls that aren't marked "nosync" as escape points for pointers that eventually escape". Treating the variable as non-escaped sounds a bit too permissive.

The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it.

I don't understand this sentence at all. What does it mean to load a pointer before you store it?

For the cross-thread cases, the definition gets more complicated.

You'd probably need a more strict form of a semantic address dependency - a kind of semantic allocation dependency, which determines which allocation to access.

For example,

  int i = load(x); // no allocation dependency, but address dependency
  int *p = load(ptr); // allocation dependency, and address dependency
  p[i] = 5;

If we have this kind of semantic allocation dependency, then we could say that having no hb between an allocation and a read on which an access to that allocation has a semantic allocation dependency, is UB, based on the fact that we shouldn't have known about the allocation in the first place.

Like, if (s-alloc ; alloc^-1) \ hb^-1 is not empty, then we have a "provenance race", where s-alloc are the semantic allocation dependencies and alloc relates an allocation to all accesses to it. (could maybe be defined as alloc = [Allocation] ; (same-loc \ hb ; [Allocation] ; (hb & same-loc)) ; [Access] )

In this case, leaking the pointer ahead of time is forbidden. But publishing the pointer after a fence (maybe in an opaque function) is allowed.

And you'd know that the allocation can't escape "out of thin air", but only through proper synchronization - so if the pointer never "eventually escapes", then unknown functions can't make it escape. They can only add proper synchronization for when the pointer eventually does escape, so your suggestion from before about treating functions that might help to synchronize as escape points for allocations that do eventually escape sounds spot-on.

The point of distinguishing this from address dependencies is that for example a hash table ought to be allowed, even if the key is not synchronized with the allocation. That's fine because the key just tells us which part of the allocation to access, but doesn't tell us which allocation to access. The question is how to define this s-alloc correctly, since the hash might point us to a linked list... So in some sense we would have an allocation dependence on the key, but we would have to exclude that somehow to avoid false positives.

Perhaps the provenance rules can be used here (in the sense that s-alloc = "get-provenance-from") but I don't have it at my fingertips if it can cover all the corner cases.

EDIT: I just saw the comment by @davidtgoldblatt discussing this approach and that it doesn't work because of how the fences create the hb relationship. An ugly solution might be to switch to a different model of hb just for the sake of defining "provenance races" - something along the ppo based models, where sb;[F>rel];sb would give hb' but sb would not - and require that the allocation hb' the load, and the load hb' the access unless the allocation full-on hb the load.

Perhaps better but still ugly, one could define a relation hb[r] for each read r which contains those hb relations that "use r" (i.e., where the proof of the hb relation makes a reference to r at some point), and require that if r --s-alloc--> b, then alloc^-1(b) --hb[r]--> b

Sending a reference to another thread without a fence would require unsafe code, but it's not clear to me it would be undefined behavior.

I am not too familiar with Rust but my assumption would be that if you have a mutable reference during some event e1 and any other reference during some event e2, and the events e1 and e2 aren't ordered by hb (e.g., because there are no fences), then you immediately have UB.

efriedma-quic commented 6 months ago

The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it.

I don't understand this sentence at all. What does it mean to load a pointer before you store it?

Basically the example at https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html#early-escape .

efriedma-quic commented 6 months ago

The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it.

I don't understand this sentence at all. What does it mean to load a pointer before you store it?

Something like https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html#early-escape

Sending a reference to another thread without a fence would require unsafe code, but it's not clear to me it would be undefined behavior.

I am not too familiar with Rust but my assumption would be that if you have a mutable reference during some event e1 and any other reference during some event e2, and the events e1 and e2 aren't ordered by hb (e.g., because there are no fences), then you immediately have UB.

You don't need a mutable reference in this context, I think; just an UnsafeCell or something like that.

JonasOberhauser commented 6 months ago

The problem is that it's hard to establish what, exactly, it means to "get a pointer". For the case where the load and the store are on the same thread, I think we can just forbid loading a pointer before you store it.

I don't understand this sentence at all. What does it mean to load a pointer before you store it?

Basically the example at https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3292r0.html#early-escape .

Thanks, now I understand what it means. The example you have shared is even more "crass" that loading the pointer before storing it - it is loading the pointer even before the allocation.

One could make a more precise example with something in this direction:

   p = new int;
   q = y.load(relaxed);
   *q = 1;
    *p = 2;
   ...

With sufficient barriers in the remainder of the code one could ensure that every load &store of the pointer in the program happens after the allocation. But I assume the alias analysis would still allow reordering them, right?

Perhaps one needs something stronger than ordering with the allocation as I thought - if I get provenance from some event, then I need to have ordering with that event. The allocation is just one place I could get the provenance from...

jasoncarr0 commented 6 months ago

I can't say I've read all of this here. I think a release fence as mentioned earlier would suffice. Specifically if the pointer is written atomically with release ordering instead of relaxed, then there is a dependency ordering (on everything but DEC Alpha).

But I guess that's just the issue here that the arbitrary function could have a release ordering store.

JonasOberhauser commented 6 months ago

I can't say I've read all of this here. I think a release fence as mentioned earlier would suffice. Specifically if the pointer is written atomically with release ordering instead of relaxed, then there is a dependency ordering (on everything but DEC Alpha).

But I guess that's just the issue here that the arbitrary function could have a release ordering store.

I guess I was stuck in thinking about what the standard allows. But of course you're right, just because the standard allows having an alias does not mean that it can happen in the implementation, and that's what's relevant for the optimization.

I don't think DEC is a problem in the examples I've seen, since we have load to store dependencies, which IIRC DEC does not reorder. (There might be examples like that too though).

So from a purely technical standpoint, it's maybe fine to say that we just need a kind of release ordering (from a fence or release store) when storing the pointer at a time where otherwise we can assume that the pointer is unique (I think restrict may have the same issues as new).

Nevertheless, morally speaking, it seems more reasonable (and easier to remember) to have a definition that mirrors the way hb works. Something in the direction of allocations need to "happen-before" pointer stores and accesses to the allocation, or you are not storing the provenance/racing the allocation, for some good definition of "happen-before".

I think one of the main things I'm confused about is whether "accidental synchronization" (e. g., with a store release of a flag, rather than a fence) should still provide provenance. Perhaps it makes not too much difference for which optimizations are allowed or not?

jasoncarr0 commented 6 months ago

I think one of the main things I'm confused about is whether "accidental synchronization" (e. g., with a store release of a flag, rather than a fence) should still provide provenance. Perhaps it makes not too much difference for which optimizations are allowed or not?

Well accidental synchronization is rarely accidental. E.g. a shared UnsafeCell guarded by a lock.

This one is a bit weird since it's not quite that pattern, it relies on a dependency ordering given by the fact that we've read the value N rather than N - 1

JonasOberhauser commented 6 months ago

I think one of the main things I'm confused about is whether "accidental synchronization" (e. g., with a store release of a flag, rather than a fence) should still provide provenance. Perhaps it makes not too much difference for which optimizations are allowed or not?

Well accidental synchronization is rarely accidental. E.g. a shared UnsafeCell guarded by a lock.

I should be more precise about what I mean by accidental synchronization. Informally, I mean that the pointer store->load are not synchronized, but in the very end you happen to be synchronized anyways. Or from another view, you are allowed to leak and receive pointers to objects before they are allocated, as long as you later wait through some other means until the allocation finishes.

Formally, I am thinking of something like a

hb(s,l) = hb*;sw(s,l);hb* | hb*;[=s];hb*;[=l];hb*

with

sw(s,l) = ([Release] | sb;[F & Release];sb); [=s] ; release-sequence; rfe; [=l] ; (sb;[F&Acquire];sb | [Acq])

, i.e. a happens-before where the store->load appear somewhere in the proof - if you have this kind of hb, then it is "properly synchronized", otherwise "accidentally synchronized".

In the case of an UnsafeCell protected by a lock, the pattern is

allocate ; store pointer ; unlock() ----> lock() ; load pointer ; access

which would be "properly synchronized" by this definition.

On the other hand,

allocate; lock();unlock(); store pointer relaxed ---> load pointer relaxed ; lock() ; access

(or the example with the flag) would be "accidentally synchronized", since at the time of the access we do have happens-before, but we actually get the pointer without any synchronization. I.o.w. the synchronization has nothing to do with passing the pointer.

From my point of view, only allowing the pattern with "proper synchronization" would theoretically allow more optimizations, since if there was any way to know that foo() does not contain a standalone release fence would mean that we can reoder non-leaked accesses around it.

And indeed, with good LTO or with some hypothetical markings like a 'no_release_fence', you could sometimes know that.

But is that an important optimization in practice?

If "accidental hb" is enough to allow the pattern (as in David's proposal), then it becomes much easier to formalize, but those optimizations would be forbidden - any release operation, including an unlock(), would suffice to prevent reordering across the function.

One issue is that checking "accidental hb" only at the end points forbids reasonable compiler optimizations if we have a full loop and return the pointer back to the original thread, like in one of David's example without fences. But if we check it "every step of the way" then inserting steps, like in David's example with the new intermediate thread that just passes the pointer, can suddenly violate the check and introduce UB, which is also hard to understand:

allocate; lock();unlock(); store pointer relaxed ---> load pointer relaxed ; lock() ; (* check here - accidental HB, fine *) access

becomes

allocate; lock();unlock(); store pointer relaxed ---> load pointer relaxed ; (* check here - no HB, UB! *) store pointer relaxed ---> load pointer relaxed ; lock() ; (* check here - accidental HB, fine *) access

and the additional step introduces UB.

Another option discussed above (if I understood correctly) would be along the lines of checking any hb on final access, but additionally if it is the same thread as the allocation, then the load must be sequenced after a store that stores the same provenance.

I find myself often forgetting about all the examples discussed here and which ones are intended to be allowed or forbidden. I'll try to summarize my current understanding:

option	flag with 2 threads	+1 thread	foo() is relaxed	foo() is release store	foo() includes release fence
only "proper synchronization"	UB	UB	reordering allowed	allowed	forbidden
David's proposal (check any hb, but at every step)	fine	UB	allowed	forbidden	forbidden
check any hb on final access	fine	fine	forbidden (and alias analysis breaks)	forbidden	forbidden
check any hb on final access + need earlier store if on same thread	fine	fine	allowed (?)	forbidden	forbidden

This one is a bit weird since it's not quite that pattern, it relies on a dependency ordering given by the fact that we've read the value N rather than N - 1

I think the dependency is a red herring, since there are also much simpler examples with just a binary flag; and the synchronization actually seems to come from normal release->acquire ordering. I believe it's just that Sung-Hwan make sure to synchronize with the last of these stores to avoid a data race.

JonasOberhauser commented 6 months ago

About restrict, I'm still trying to untangle all the words in the standard definition (looking at C11 now), but I think it may really be a similar situation.

void evil(int * restrict p) {
    int *magic = extern_func();
    *magic = *p; // magic is "based on" p  !!
    y.store(p, std::memory_order_relaxed);

This does not violate "Every other lvalue used to access the value of X shall also have its address based on P. ", does it? I don't see any other clause that would render this UB, but I would assume alias analysis will tell you that magic and p are not the same (EDIT: I checked with David's example and it ends up doing the same optimization as with new).

I think this will make "has been stored before" really really hard to make precise, since before calling evil(p) I could store p somewhere.