rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do
https://rust-lang.github.io/unsafe-code-guidelines
Apache License 2.0
650 stars 57 forks source link

Expressivity Gap: concurrent ABA-safe stack (atomics + provenance = pain) #480

Open RalfJung opened 9 months ago

RalfJung commented 9 months ago

Alice Ryhl brings up the following data structure: a concurrent stack where all elements can only be popped at once.

Stack declaration:

static STACK_TOP: AtomicPtr<Elem>;

struct Elem {
    next: *mut Elem,
}

Push:

let elem = malloc();
elem.next = STACK_TOP.load(Acquire);
while let Err(new_next) = STACK_TOP.cas(elem.next, elem, AcqRel) {
    elem.next = new_next;
}

Pop all:

let next = STACK_TOP.swap(null);
while !next.is_null() {
    let curr = next;
    next = curr.next;
    use(curr);
    free(curr);
}

This can run into ABA situations in push if, between the load and the cas, the stack gets popped, freed, new elements get pushed, and they re-use the same address. Then STACK_TOP will contain an Elem whose next points to the right element but still uses the provenance of the old element. Adding any code between the load and the cas cannot fix this since the new element might bet allocated after all of that code runs, and from_exposed_addr cannot "grab" a provenance that does not even exist yet.

IMO what should be done here is that next should be declared to have type usize, and one just passes the pointers through exposed_addr/from_exposed_addr when they enter/leave the stack. However some people seem to consider it very important that the stack uses a normal pointer type and can be used in the rest of the code without being aware that there are concurrency shenanigans going on. I am not sure why that is so important, given in particular that I think this will at best be an utter nightmare to specify, but here are some options I have seen or thought of:

  1. I have seen a proposal to fix this for C++ with some sot of operation that "refreshes the provenance of a pointer and every other pointer reachable through it". That operation sounds like a nightmare for program verification since it can change state that is very far away from where that operation happens; we don't currently have any operation that reaches transitively through pointers and I think that kind of locality is important.
  2. Make it so that from_exposed_addr actually works here. For instance, rather than angelically guessing the provenance when from_exposed_addr get called, we could generate a fresh symbolic provenance and gather constraints on how that provenance is used, and then raise UB when the constraints become unsatisfiable. (This is basically the "udi" part of PNVI-ae-udi.) Unfortunately I have no great idea how to do this with a complex aliasing model.
  3. Introduce a kind of provenance that has the following effect: when doing a load through a pointer with such provenance, all provenance that was loaded is refreshed by passing the pointer through |p| from_exposed_addr(p.exposed_addr()), and the loaded provenance also recursively obtains this effect. However, now loads through such pointers cannot be DCE'd, since they have the side-effect of exposing some provenance, so this doesn't seem practical either.

So to me (2) is the only realistic option I have seen so far -- which conveniently would also get rid of angelic choice again, if we can make it work...

Darksonn commented 9 months ago

However some people seem to consider it very important that the stack uses a normal pointer type and can be used in the rest of the code without being aware that there are concurrency shenanigans going on.

The swap establishes a happens-before to anyone who touched any element in the linked list we got from pop_all, so my intuitive feeling is that we should end up with a linked list that isn't involved in concurrency shenanigans anymore.

RalfJung commented 9 months ago

Yeah you don't have to worry about race conditions, I agree.

ABA is still a problem though. Fundamentally the issue is that cas on pointers is a rather weak operation -- it checks that the addresses are still the same, but cannot check the provenance. Load-link / store-conditional could actually give us a better semantics here where the CAS is guaranteed to fail if there was any write between the load and the store, including a write that changed the provenance but not the address...

But without LLSC, all we got is this weak signal that the address didn't change, and that's unfortunately insufficient to avoid ABA issues in the Abstract Machine. The stack is not involved in concurrency shenanigans any more, but it is involved in provenance shenanigans. We can either do something global that has an unbounded effect (hard to reason about, both for program verification and optimizations), or we acknowledge that the stack is still a "strange stack" and indicate that to the compiler each time we access next.

Diggsey commented 9 months ago

I think there is another class of solutions (or at least variations on 1.) which involve making CAS special:

When a CAS operation succeeds, "refresh" the provenance on all pointers with the address involved in the CAS (it doesn't matter if it's the source or destination address since they're the same if the CAS succeeded).

Diggsey commented 9 months ago

Out of curiosity, does a CAS on CHERI take into account the permissions on the pointer?

RalfJung commented 9 months ago

"refresh" the provenance on all pointers with the address involved in the CAS

As in, scan the entire memory of everything and update provenance? That's even worse than an intrinsic to "refresh provenance of everything reachable". I don't consider such non-local actions to be acceptable. They also interact poorly with other parts of Rust, e.g. you might be refreshing (as in, changing) the provenance of pointers stored behind shared references, thus violating the immutability of shared references.

Diggsey commented 9 months ago

Also with (3) it's not obvious to me why the effect needs to be recursive?

Surely the following code is valid if such an erase_provenance() operation existed:

let elem = malloc();
elem.next = erase_provenance(STACK_TOP.load(Acquire));
while let Err(new_next) = STACK_TOP.cas(elem.next, elem, AcqRel) {
    elem.next = erase_provenance(new_next);
}

(Where erase_provenance() creates a pointer whose provenance is "refreshed" on load)

Darksonn commented 9 months ago

Your erase_provenance is essentially what I suggested originally on zulip. The problem is that it needs to be able to choose a provenance that doesn't exist yet (because the call to malloc hasn't happened yet).

RalfJung commented 9 months ago

(Where erase_provenance() creates a pointer whose provenance is "refreshed" on load)

Ah true, it doesn't have to be recursive since we can apply it to all relevant pointers when putting them into the list.

elem itself should also get that treatment though.

Your erase_provenance is essentially what I suggested originally on zulip. The problem is that it needs to be able to choose a provenance that doesn't exist yet (because the call to malloc hasn't happened yet).

It's not quite what you proposed, or at least not what I understood you proposed. The idea with this kind of "deeply erased" pointer is that doing a load through this pointer refreshes the provenance of the pointers it loads. So we never need to make a choice that picks a future allocation.

However, now each load becomes a possible "expose", which kills this proposal immediately -- that's a too big side-effect for a load.

Darksonn commented 9 months ago

True, it's not entirely the same.

Diggsey commented 9 months ago

elem itself should also get that treatment though.

Why would elem need that treatment? Its provenance is never in doubt.

Imagine we had a magic defer_provenance_until_cas() operator which could be applied to the result of a load to split that load into two components:

  1. Load the address
  2. At the next CAS operation, also load the provenance for (1) and apply it everywhere the loaded address was used, in the same atomic operation as the CAS so there is no race.
let elem = malloc();
elem.next = defer_provenance_until_cas(STACK_TOP.load(Acquire));
while let Err(new_next) = STACK_TOP.cas(elem.next, elem, AcqRel) {
    elem.next = defer_provenance_until_cas(new_next);
}

I think this code would clearly be valid without needing any special treatment for elem. I would argue that erase_provenance is equivalent to this magic operator.

However, now each load becomes a possible "expose", which kills this proposal immediately -- that's a too big side-effect for a load.

Perhaps with a small tweak it doesn't need to be an "expose" though:

let elem = malloc();
elem.next = erase_provenance(STACK_TOP.load(Acquire));
loop {
    match STACK_TOP.cas(elem.next, elem, AcqRel) {
        Ok(prev) => {
            expose_addr(prev);
            break;
        }
        Err(new_next) => {
            elem.next = erase_provenance(new_next);
        }
    }
}

Now the expose is an explicit part of the algorithm, so loading a pointer with a "cleared provenance" doesn't require exposing the address of that pointer, it only requires the "from_exposed_addr" part.

Diggsey commented 9 months ago

I have just realised that this does introduce a race condition if the pointer is loaded before the previous address is exposed, so exposing the address would have to be an atomic part of the CAS operation, rather than something done explicitly as part of the algorithm.

I think the above example is still useful to explain what I'm suggesting though.

comex commented 9 months ago

Out of curiosity, does a CAS on CHERI take into account the permissions on the pointer?

On Morello at least, the manual says the capability has to be bitwise identical for CAS to succeed, which I'd assume means the entire 129 bits including address, metadata, and valid bit (tag).

In other words, in practice, this kind of stack should 'just work' on CHERI… as long as you use pointer types. Which makes me skeptical of the idea of recommending usize instead.

However, even on CHERI this would be a problem from an Abstract Machine perspective, since even two bitwise-identical capabilities may not have the same abstract provenance. (If I understand correctly, the original example wouldn't work as-is, because a CHERI allocator isn't supposed to reuse memory before tag-GC clears the valid bits of all dangling pointers. But I think you can encounter a similar problem without involving the allocator?)

Which makes me also skeptical of solutions that rely on expose_addr and from_exposed_addr, whether implicitly or explicitly, because those primitives don't exist on CHERI.

But perhaps the Abstract Machine could add some sort of weaker equivalent of expose_addr/from_exposed_addr that only works for capabilities that are already fully equal from a hardware perspective?

RalfJung commented 8 months ago

Why would elem need that treatment? Its provenance is never in doubt.

I was thinking it's needed in preparation for the next push. But maybe that's not actually the case.

apply it everywhere the loaded address was used

That's a frighteningly global operation. How is "address was used" even defined? It's an integer, it doesn't have provenance, so we can't determine where this value has flowed.

I wonder though... instead of "everywhere it was used", at least for this particular operation we'd actually just have to put it in one place: elem.next. So what if the CAS took a mutable pointer to the "current" value for the comparison, and then it atomically does this

This would ensure that right after a successful CAS, the value in STACK_TOP (which did not change) and the value in elem.next would indeed be fully equal (equal address and equal provenance). So I think it completely fixes this particular algorithm?

However I am concerned that this is fixing a special case instead of a larger problem. That said, the fact that this ensures "after CAS the two values are momentarily equal" does seem like more than just a band-aid.

RalfJung commented 6 months ago

However I am concerned that this is fixing a special case instead of a larger problem. That said, the fact that this ensures "after CAS the two values are momentarily equal" does seem like more than just a band-aid.

I was told there are more examples in https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1726r5.pdf, but I have not yet taken a closer look.

chorman0773 commented 6 months ago

I have seen a proposal to fix this for C++ with some sot of operation that "refreshes the provenance of a pointer and every other pointer reachable through it". That operation sounds like a nightmare for program verification since it can change state that is very far away from where that operation happens; we don't currently have any operation that reaches transitively through pointers and I think that kind of locality is important.

This sounds like std::launder, which was added in C++17.

RalfJung commented 6 months ago

Reading the docs for launder, I don't see a close relation. The fifth point of the docs say "the result"; I don't know what that means since we are in the process of defining what the result will be, so we can't refer to the result as if it were an already defined object. I would say the docs are ill-defined.

Judging from the examples, all launder can do is "extend" the provenance of a pointer to its surrounding array. In Rust that's entirely unnecessary if we just say that array elements do not have subobject provenance to begin with. No arrays are involved in the example above so I don't think it can be helpful here.

launder does not say anything about changing the provenance of all pointers that can be found by recursively traversing memory, so I think it does not help at all with the question discussed in this issue.

RalfJung commented 6 months ago

So turns out maybe this really isn't ad-hoc at all! We debugged a nasty RwLock concurrency issue and tracked it down to an ABA problem very similar to this one. Except that one needs a slightly more powerful version of this "provenance-transferring compare-exchange": the pointer that should acquire the provenance might be different from the one that we are comparing with, since there might be some bitmasking applied.

fn compare_exchange_transfer_provenance(
    &self,
    current: *mut T,
    new: *mut T,
    provenance_target: *mut *mut T,
    success: Ordering,
    failure: Ordering
) -> Result<*mut T, *mut T>

Would atomically execute:

let old = self.read();
if old == current {
  // Transfer provenance of old value.
  // Doesn't change the actual bytes in memory, so backends can ignore this.
  provenance_target.write(old.with_addr(provenance_target.read().addr()));
  // Store new value.
  self.write(new);
  // Done.
  Ok(old)
} else {
  Err(old)
}

Then the push at the top could be written as

let elem = malloc();
elem.next = STACK_TOP.load(Acquire);
while let Err(new_next) = STACK_TOP.compare_exchange_transfer_provenance(elem.next, elem, addr_of_mut!(elem.next), ...) {
    elem.next = new_next;
}

I guess there are some thorny questions about that extra write we are doing. like, how does it show up in the memory model? What if it races with reads/writes to the same location? We probably want to forbid that, even if the other reads/writes are atomic. You can't synchronize via this provenance write.

gereeter commented 2 months ago

I personally am quite scared by introducing an extra write in concurrent code with unclear semantics. I'm not even sure forbidding races is viable, since for example another thread could gain access to the pointer via a successful compare_exchange_transfer_provenance(..., Ordering::Relaxed), then not have any method to establish happens-before to avoid racing.

As an alternative that entirely avoids this extra write, what if we added a function fn future_provenance<T>(ptr: *mut T) -> *mut T that "magically" (read: non-deterministically that is later constrained by a no-behavior guard or laziness) replaces the provenance of a pointer with the provenance that will be attached to the pointer at the time of a successful compare_exchange? Then the original example would simply be changed to

Push:

let elem = malloc();
elem.next = future_provenance(STACK_TOP.load(Acquire)); // This line changes
while let Err(new_next) = STACK_TOP.cas(elem.next, elem, AcqRel) {
    elem.next = future_provenance(new_next); // As does this
}

Similarly, the lockless queue ABA problem can be solved with let mut state = future_provenance(queue.state.load()); and Err(new) => state = future_provenance(new),.

This should be very easy to implement: before lowering, future_provenance can be viewed as an opaque function, and at the moment of lowering to a machine / language without provenance (e.g. most machine code) it can be translated into a noop. I can't speak about CHERI with confidence, but if its compare_exchange covers provenance bits, I feel like it ought to be possible to do the same there.

There are still some funky details to work out about what this means on the abstract machine when the usage isn't quite the straightforward CAS loop (e.g. what if the pointer is accessed before the CAS or passed to multiple CASes or...), but I think those can be forbidden or made okay fairly easily. For a concrete, executable, and deterministically checkable proposal:

I don't love everything about this proposal, but I think it or something like it has promise: it doesn't involve changing code very much, it's very flexible, it doesn't complicate the memory model, and it has clear implementation strategies.

RalfJung commented 2 months ago

As an alternative that entirely avoids this extra write, what if we added a function fn future_provenance(ptr: mut T) -> mut T that "magically" (read: non-deterministically that is later constrained by a no-behavior guard or laziness) replaces the provenance of a pointer with the provenance that will be attached to the pointer at the time of a successful compare_exchange? Then the original example would simply be changed to

I am very strongly opposed to a semantics that requires predicting the future if less extreme alternatives exist. What you are describing is called "prophecy variables" in the literature, and they can be useful in verification, but they should at best be a last resort in the specification of a language.

One of the reasons why they are so problematic is that they make it entirely impossible to have a tool like Miri. Your notion of what is needed for an "implementation strategy" is clearly quite different from mine. They also need to be carefully constraint or else you end up with "time travel paradoxes", which in this case would surface as an impossibe-to-compile program (namely if the program uses the return value of future_provenance to influence what happens in the compare_exchange -- not sure if that can happen, but the very fact that this provenance can be used at all before the compare_exchange has significant potential for trouble).

chorman0773 commented 2 months ago

Agreed - preserving linearity as a property is something I'd like (and I'd expect any compiler implementor would as well).

As for a paradox, here's one:


let mut x = 0;
let mut rx = &mut x as *mut i32;
let mut px = core::ptr::addr_of_mut!(rx);
*px = core::ptr::addr_of_mut!(px).cast();

let mut p = core::ptr::invalid(px.addr()); // Destroy provenance, but not to worry, we'll get it back later.

let fp = core::ptr::future_provenance_mut(p);

let ap = AtomicPtr::new(fp.read());

let _ = ap.compare_exchange(p, &mut x, Ordering::Relaxed, Ordering::Relaxed); // we know this succeeds because p and fp compare bitwise equal ignoring provenance
gereeter commented 2 months ago

I agree that predicting the future and prophecy is dangerous here, but I think it's a much more manageable case than usual because I can't think of an application where you need to actually observe the future provenance until it's no longer in the future. That's why in my concrete semantics I specified that it is undefined behavior to dereference a pointer with lazy provenance before that provenance is initialized; @chorman0773, your example immediately hits that undefined behavior. I very much want it to be checkable in something like Miri. Perhaps I should have used less time travel-y framing because that is unnecessary; it simply was the direction that I initially approached it from.

chorman0773 commented 2 months ago

Fair enough.

Although, I think the way this will break is with an OOTA set up and a couple more threads.

I agree that predicting the future and prophecy is dangerous here, but I think it's a much more manageable case than usual because I can't think of an application where you need to actually observe the future provenance until it's no longer in the future.

Still, if you use future_provenance first, then allocate something at a particular address, I think you'll find that any optimizing compiler will assume a new allocation won't ever be pointed to by a pointer that was created before it (even if you prove the addresses are the same). In general, the linearity property preserves the fact that you can't get a pointer to an allocation before that allocation becomes live.

RalfJung commented 2 months ago

I very much want it to be checkable in something like Miri. Perhaps I should have used less time travel-y framing because that is unnecessary; it simply was the direction that I initially approached it from.

After reading your comment in more detail, I have to apologize -- my reaction was based on "I don't have much time, but I read the summary and that sounds like a direction I am not happy with". I entirely missed the part of your post where you actually give an operational semantics that realizes the idea (and that is not fully equivalent to actual prophecy). I shouldn't have posted before reading your entire comment.

Okay, so... this introduces a new kind of provenance, and a new kind of global state related to that. But from what I can tell it is entirely agnostic over the underlying provenance model, and in particular it does not require any notion of "union provenance" or so -- it should be compatible with Stacked Borrows, Tree Borrows, everything. That's neat.

The while let loop in dereferenceable does raise some alarm bells: are we sure this loop will always terminate? It would be unfortunate if one could introduce a new kind of infinite loop to a program by setting up a cycle of lazy provenance IDs... so I wonder: in CAS, could we enforce that the provenance we set (current.provenance) is non-lazy? Or would that be an undue limitation?

One new aspect of this is that CAS now also mutates this new global state, the lazy provenance table. So technically lowering this to LLVM requires CAS to also be made an opaque operation, or else LLVM might reorder it with pointer derefs that are guaranteed not to alias the CAS... except unfortunately the deref depends on the lazy provenance table state and so moving it up across the CAS makes it UB.

So, I am not convinced yet that this will all work out. OTOH my proposal also needs LLVM changes so it's not clear either proposal has a clear lead when it comes to "how well can this be compiled to LLVM".

It is amusing how close this is getting to @Darksonn's original question that started this thread -- "can an int2ptr cast grab a provenance that is exposed in the future?"

digama0 commented 2 months ago

The while let loop in dereferenceable does raise some alarm bells: are we sure this loop will always terminate? It would be unfortunate if one could introduce a new kind of infinite loop to a program by setting up a cycle of lazy provenance IDs... so I wonder: in CAS, could we enforce that the provenance we set (current.provenance) is non-lazy? Or would that be an undue limitation?

I half-suspect that you can argue that the lazy provenances are acyclic by induction over one of the acyclic orderings of the memory model (assuming OOTA can be excluded by some TBD mechanism, else the whole memory model is sunk because you can forge private provenances), but it's not really necessary to resolve that issue to avoid this particular infinite loop: a basic cycle detection here (keep a stack of visited nodes and fail if you revisit one) will correctly and completely detect this issue (and this can't run forever because you can only have finitely many provenances at any given point in the program). Presumably dereferencing a lazy provenance cycle is also UB.