Removing allocations is technically unsound in OOM (out-of-memory) situations

RalfJung commented 2 years ago

The following program might be considered to be guaranteed to run out of memory, but LLVM actually compiles it in a way that it does not:

int *x = malloc(SIZE_MAX);
int *y = malloc(SIZE_MAX);
if (x && y) {
  printf("This is unreachable. I could deref a NULL pointer and this program would still be fine.");
}

It is not clear how to formally justify this.

@JakobDegen also has an example for recursion exhausting memory via stack allocations and being similarly optimized away.

JakobDegen commented 2 years ago

I've since improved upon the example:

fn bad(limit: usize) {
    if limit == 0 {
        return;
    }
    let mut x: usize = limit;
    let r = &mut x;
    bad(limit - 1);
    if *r != limit {
        panic!();
    }
    return;
}

fn main() {
    bad(usize::MAX);
    println!("unreachable");
}

The interesting things about it are that 1) it never does anything to inspect the values of the pointers and 2) all of the values in the locals are genuinely live when the recursion bottoms out, so no amount of liveness analysis will get us out of this.

JakobDegen commented 2 years ago

My original idea for a way to solve this was to not require that addresses of memory in different allocations be unique in general, but to magically guarantee that this is indeed the case for any two allocations that have the property of both being live and having at least one exposed (via ptr.expose_addr()) item in their borrow stacks. That runs into this cursed example though:

let p1 = alloca(8);
let p2 = alloca(8);

if abs_diff(p1.addr(), p2.addr()) < 8 {
    // Dead code, but removing it is unsound
    p1.expose_addr();
    p2.expose_addr();
    println!("Unreachable");
}

if abs_diff(p1.addr(), p2.addr()) < 8 {
    println!("Unreachable");
}

dealloc(p1);
dealloc(p2);

This means that even if we are willing to give up unconditional address uniqueness (which I think we have to given the example above), it is still entirely unclear how to make this work

JakobDegen commented 2 years ago

I've written up a proposal for a fix here: https://hackmd.io/@2S4Crel_Q9OwC_vamlwXmw/HkQaMnB49

RalfJung commented 2 years ago

At a high level, it seems like that proposal moves the OOM to when expose_addr is done? That's when OOM errors occur, and hence removing an expose_addr now becomes impossible for the same reason that removing OOM-causing operations was impossible previously.

So, doesn't this example still have the same problem?

let x = malloc(isize::MAX);
let y = malloc(isize::MAX);
let z = malloc(isize::MAX);

if x.expose_addr() != 0 && y.expose_addr() != 0 && z.expose_addr() != 0 {
  unreachable!();
}

free(x);
free(y);
free(z);

It should be possible to remove this expose_addr, since we can see the allocation being removed, so no valid code can ever pick this provenance in a future from_exposed_addr.

RalfJung commented 2 years ago

(After talking with @JakobDegen on Zulip.)

Indeed this doesn't entirely solve the problem of removing dead allocations -- one can still not remove dead allocations that are being exposed. However, that sounds like a big step forwards! Most allocations are not being exposed so removing them can be justified by this model. (And if we keep something like AllocId, we don't need many different "domains" either, just a single per-allocation boolean indicating whether it is part of the "bottom" domain.)

Basically, if we can carry the AllocId around reliably, we don't need "non-overlapping addresses" to figure out which allocation an access is intended to access.

The downside is that expose_addr becomes even more of an optimization barrier. Previously there were some conditions where it could be removed, that seems a lot harder now.

What remains somewhat fuzzy to me is that the address gets picked when the allocation is created, but if it gets exposed then that address may later fail to be disjoint and the program is doomed for OOM. Basically, it seems impossible for a programmer to prove, under this model, that their program does not go OOM. However, these kinds of models usually have to allow any allocation to non-deterministically OOM anyway, so... that doesn't really make things any worse, I think? Certainly some use-cases do need to prove absence of OOM, but they need a different model to begin with. I wonder if there are good ways to relate the two...

Now I wonder: what goes wrong if we just remove the "addresses must be disjoint" part entirely, without replacing it by some condition on expose_addr like @JakobDegen proposed? In from_exposed_addr we just non-deterministically guess which allocation is meant, as before. Shouldn't that still work?

JakobDegen commented 2 years ago

My third example was intended to show that this is potentially problematic:

// May these get overlapping addresses?
let p = alloca(8); *p = 0;
let q = alloca(8); *q = 0;

let a = p.expose_addr();
q.expose_addr();

let x = a as *mut usize;
*x = 5;
dbg!((*p, *q));

Legal behaviors for this program that users definitely expect would seem to be for it to OOM, or for it to print (5, 0). I believe your suggestion also allows it to print (0, 5), which seems extremely surprising (or even (0, 0) if there is any other exposed pointer anywhere else in the AM)

RalfJung commented 2 years ago

That's a fair point. Basically this boils down to, the angelic choice that is magically made should only be allowed to affect UB, not observable behavior.

RalfJung commented 2 years ago

I think I just realized there is a problem with this scheme: it breaks code like PartialEq for slices that has a fast-path for when both slices have the same address and length. Basically that would have to use expose_addr rather than addr, which (a) nobody will get right, and (b) makes optimizations hard, somewhat defeating the point of the fast-path.

JakobDegen commented 2 years ago

I think there are even more problems than that. For example, how does this work with FFI: We don't guarantee that this code

extern "C" { fn foo(a: &u8, b: &u8) }

let x = 0;
let y = 0;
foo(&x, &y);

will pass distinct pointers to FFI. This was not a concern with other situations, since in those cases the compiler had to assume that FFI code does the worst crimes it possibly could, but here the compiler has no such concerns. This can be fixed by saying something akin to: On FFI calls, we find all the pointers reachable from the arguments/globals and .expose_addr() them; this ensures that OOMs happen before the call, and so if the call succeeds everything has disjoint allocations.

This is ugly though, and I would absolutely love to have a model that does it better. I just don't know one

RalfJung commented 2 years ago

I don't see why things would be any different for an FFI function than for a Rust function.

JakobDegen commented 2 years ago

Oh, yeah, nevermind. This probably doesn't work then, which means we probably have to scrap a lot of this model. Really I think the most important thing we need is some way to address the annoying circular reasoning you can do in Example 2. I don't know how to go about it though.

JakobDegen commented 2 years ago

I thought about this some more, and can make the issue somewhat more concrete. For slightly weird reasons, I do think this is actually an issue specifically with external functions (such as FFI). Consider

fn bar() {
    foo(&1);
}

#[inline(never)]
fn foo(a: &i32) {
    a.expose_addr();
}

A naive implementation of the Rust AM under my model would require us to include a check in foo to ensure that the addresses of a does not overlap with any other exposed allocation. Obviously this is borderline impossible to practically implement. However, a compiler could be smarter. It could instead do things in two stages:

In the first stage (roughly resembling MIR) that check is implicit as a part of the .expose_addr(). We don't worry about how to implement this in actual hardware, we just pretend we can. This allows us to do all the optimizations that are enabled by this memory model.
In the second stage (roughly corresponding to asm) that check is no longer implicit. However, now our IR instead includes a promise that all allocations are created in the exposed address domain. This means that when we lower to this IR, we no longer have any need to include the check - it would come back false anyway. Of course, by doing this, we now once more have only finite memory and in general removing allocations at this stage is unsound - but that's ok, we've already gotten a chance to do our optimizations.

This kind of model almost works, except when you go and make foo pub. Now you're in trouble, because you can't expect that whatever will call foo, which may be arbitrary Rust code, will uphold the expectation of the lowered IR. In other words, when all the code we cared about was still under the control of this compilation session, our compiler could do smart tricks to avoid having to implement the check. Now that that is no longer the case, we're in trouble.

I don't see a way to fix this. As I said above, at this point I'm mostly just trying to think of a brand new idea.

digama0 commented 2 years ago

I think this is why I was saying that it needs to be a compiler obligation that the OOM case of expose_addr is never actually codegenned. In the case of foo, I believe this means that we would need to mark the function as potentially-exposing (or actually-exposing if the body is visible), and in general for FFI calls we can't know whether addresses are exposed and must assume that they are.

We can therefore lift the exposure into bar, i.e. replace it with let x = &1; x.expose_addr(); foo(x). This is pessimistic since we might be prematurely exposing x and also the exposure could be conditional in foo if it is opaque, but since we are permitted to spuriously OOM, this is a legal transformation.

You can therefore optimize bar using the same two-stage approach - it's just that all calls to previously optimized and opaque functions are treated as exposing everything that is reachable. In short, I don't think FFI causes any additional issues.

JakobDegen commented 2 years ago

You could certainly have such a requirement for your compiler, but I don't see how we can in general enforce this requirement on other compilers - if we try and write a rule that says they must obey it, then we essentially once more end up with this being a part of the AM, which has problems as we've seen

digama0 commented 2 years ago

I'm not sure I follow. The model has all code which is linked together cooperating under the rules of the Rust AM, which is perhaps a little self-centered but should be fine. Different compilers communicating across FFI just need to make pessimistic assumptions about the other side - in this case, that FFI calls require all allocations to already be exposed when they are passed in, meaning that the caller should call expose_addr if necessary first. These calls are essentially fictional, in that they will be optimized away during your stage-2 compilation. For compilers that assume the allocator always gives disjoint ranges, they are already in stage-2 from the get-go so there is no special handling needed.

JakobDegen commented 2 years ago

Different compilers communicating across FFI just need to make pessimistic assumptions about the other side

In what sense is this a pessimistic assumption? Of course the compiler has to assume that the other side may emit .expose_addr(), but under this model that is not it's problem. Now you suggest adding a rule that says that the OOM on .expose_addr() must never happen, but making this promise to the user is imo not an option.

Now if we instead make a rule that says "you must .expose_addr() before FFI calls on everything reachable from that call" we do turn the .expose_addr() into our responsibility and prevent the other side from having to possibly codegen the OOM on .expose_addr(). The problem with this rule is this though: When you say "FFI," I assume you mean "across compilation units," since that is as far as the compiler can see. Unfortunately, this means that if we inline any code across compilation units (which rustc does today even in non-xLTO situations) we now have to add an .expose_addr() in the middle, which is obviously bad.

digama0 commented 2 years ago

I mean the second thing: "you (the downstream compiler) must .expose_addr() before opaque function calls", i.e. the extern function has an implicit precondition that it receives only exposed allocations, which justifies the upstream compiler's decision to not emit any OOM calls in the function if it does call .expose_addr(). (Obviously the intent here is that this obligation gets pushed all the way up so that no one needs any OOM calls at all and we have effectively achieved what you say in the first paragraph: the compiler has ensured that expose_addr never results in an OOM call.)

If you add the expose_addr in the middle and then optimize with e.g. LTO, then yes you will have to keep the expose. If you can inline in stage-1 though you can avoid it. I think a lot of inline functions in rustc are stage-1 compilation artifacts anyway (i.e. MIR) so they would not hit this issue.

JakobDegen commented 2 years ago

I don't see how this works. When you talk about this sort of obligation, who is this obligation to, where is it written down, and in terms of what is it defined?

It seems like you're suggesting that this obligation is just in the language itself and is defined in the spec, but the spec has no notion of stage 1 or stage 2. The stages are a useful distinction when talking about the internal reasoning a compiler can do, I don't think we can add a rule in terms of them.

digama0 commented 2 years ago

The AM would just have a rule about overall behavior, and passes right through any FFI calls. The compiler produces a number of compilation artifacts like .rlib files, and these can have additional compiler-specific invariants on them, as long as the final linked executable satisfies the AM rules. In this case, the obligation to expose allocations on stage 2 artifacts and not on stage 1 artifacts is a property of these compiler-specific intermediate files, as part of its implementation of linking pursuant to the rules of the AM.

The linking rules are still somewhat under the control of the compiler, since it is only the linked executable that has AM dynamic semantics. So for instance the exact repr(Rust) calling convention does not need to be part of the spec, even though it is important for linking .rlibs together. The compiler needs to ensure that when it links everything together, the composite artifact does not violate the AM rules, but that does not pose any problem to inlining across compilation units if it has stage 1 compilation artifacts which contain information about whether they assume their inputs are exposed in the compiler metadata.

JakobDegen commented 2 years ago

I mean, this is how it works right now (more or less), but that's not really enough. For example, this would make it unsound for xLTO to remove dead locals.

More generally though, the problem is that it's not possible to implement the AM without imposing additional rules. Right now, in order for two Rust implementations to be able to link to each other, the only things they need to agree on are the AM parameters - layout, calling convention, things like that. But with this model in place, even with all AM parameters agreed upon, the side trying to compile foo is in trouble because it cannot actually implement foo with the full generality required by the AM. Instead, it has to implement a reduced version of foo and then nicely ask the thing it's linking to to please not do things that would make the non-compliant implementation observable. I don't think that's a direction we want to go in - the AM should be "trivially" implementable, without the need to impose additional global rules.

comex commented 2 years ago

I don't see the problem.

At the actual assembly level, "all addresses passed across [standard-calling-convention] function call boundaries must be exposed" is naturally true. After all, all objects that still exist at the assembly level must have unique addresses that don't overlap with other simultaneously-live objects.

Meanwhile, the LLVM IR level is at least one place where the compiler currently removes unused allocations. (I think it's currently the only place where it does so.) Assuming we want the compiler to continue to be able to perform that optimization, "all addresses passed across calls must be exposed" must be false.

So why not posit that, in the current implementation, LLVM IR is stage 1 and assembly is stage 2? Then the 'additional rule' only applies to assembly, where it's automatically satisfied. And LTO is no longer a special case, since that just involves LLVM's optimizer performing the same optimizations across compilation units that it normally performs within a single compilation unit, which are presumptively valid.

JakobDegen commented 2 years ago

Let me try and present this a different way. Say that I am a Rust compiler looking to emit a static library corresponding to the following Rust crate:

pub extern "C" fn foo(a: &i32) {
    (a as *const i32).expose_addr();
}

Specifically, the goal is for me to emit code that has behavior identical to that of the abstract machine - that is what means to compile Rust after all. But that is not possible. In order to compile this, I must make additional requests on top of the rules associated with the language.

digama0 commented 2 years ago

Such a rust compiler could emit two versions of the function:

A stage 1 artifact in an internal format, like an .rlib record for an #[inline] library function. This would be something like MIR, and would include explicit information suitable for interprocedural analysis of expose_addr calls. In this case, the call would be marked as "a must be exposed" because the function calls expose_addr on a, so it would effectively be part of the (internal and unstable) calling convention that a call to foo requires that a be exposed. The code itself does nothing.
A stage 2 artifact because it is an extern "C" function which can be linked to e.g. C compiler output. This has a stable and documented calling convention that requires that all passed parameters are exposed. The code itself does nothing.

Compare this to the following function:

pub extern "C" fn bar(a: &i32) {
    (a as *const i32).addr();
}

The stage 1 artifact now contains metadata saying that a does not need to be exposed before calling the function, which forms part of the (internal and unstable) calling convention. The code itself does nothing.
The stage 2 artifact is unchanged: a must still be exposed before calling the function in order to conform to the stable extern "C" calling convention.

Of course, the compiler can choose to only emit stage 2 artifacts if it wants, or to not include this extra bit of interprocedural analysis data in the stage 1 artifact and always require exposure. In either case the loss is that inlining across codegen units may result in unnecessary exposure of locals.

In fact, the stage 1 artifact need not contain an explicit annotation with the exposure analysis. If it is an #[inline] function, the compiler is free to just inline the whole thing and observe that no call to expose_addr is made, regardless of the actual calling convention. (Alternatively stated, the real calling convention of such an artifact is the "weakest precondition" of the contained code, and the type signature of the function is simply a sound underapproximation of this predicate. Inlining the code is making use of the weaker precondition directly.)

digama0 commented 2 years ago

Regarding Ralf's point about PartialEq on slices, I think we already have the necessary API to do the right thing here: ptr::guaranteed_eq can be specced to always return false for addresses in different domains. That way if the slice pointers are compared using guaranteed_eq, if the compiler can prove they are from different domains it can fold it to false, if it can prove they are the same domain then it can fold it to p1.addr() == p2.addr(), and otherwise it can reduce it to p1.expose_addr() == p2.expose_addr(). In any case this is a valid refinement of the nondeterminism.

JakobDegen commented 2 years ago

in order to conform to the stable extern "C" calling convention

Where is this rule coming from? If "extern "C' requires exposed pointers is just a language rule, then we can't inline the following without adding a .expose_addr()

extern "C" fn bar(a: &i32) -> i32 {
    *a
}

fn foo() {
    bar(&2);
}

In any case, I don't think I'm making my point clear here, so I'll probably stop arguing it. Maybe someone else can explain more elegantly than me.

digama0 commented 2 years ago

(I'm using "calling convention" in a somewhat loose way to also include all the machine preconditions for soundness, not just data layout. I don't think we've miscommunicated on this point but it seems worth calling out.) Let me elaborate on the weakest precondition thing I mentioned in the previous post.

The weakest precondition of a piece of code C is the least you need to assume in order to ensure that the code does not have UB and some postcondition Q holds afterwards. This predicate is defined so as to make this property a tautology: If before running C we know property WP(C, Q) := "running C in this state will not cause UB and will result in a state satisfying Q", then of course we will not cause UB and result in a state satisfying Q. Normally we would assume something stronger and less tautological, like "these variables exist in memory at these locations and types", although compilers will generally always assume as a precondition that running the code will not lead to UB.

Relating this to the present case, if we have a non-opaque function (i.e. we have code for it), then there are essentially two ways call it: we can actually emit a call instruction, or we can inline the code. The minimum requirements we need to satisfy the AM execution we are after is the weakest precondition of the contained code, while the construction of the function as a callable thing has ensured that the function's calling convention is a stronger than this weakest precondition, so calling the function also satisfies the AM requirements as long as the calling convention is upheld in the call.

So if we find a call to an extern "C" function, then that means that we have to uphold the calling convention if we call the function, which means exposing the address. But importantly, exposing the address is not an operation that happens as a result of the call, it is only a convention between the caller and callee when they interface using an explicit call. If you don't do a call, the calling convention doesn't matter, the contained code matters, and if there isn't an expose in the inlined code then the obligation to expose the address disappears.

in order to conform to the stable extern "C" calling convention

Where is this rule coming from? If "extern "C' requires exposed pointers is just a language rule, then we can't inline the following without adding a .expose_addr()

This is all to say that "extern "C' requires exposed pointers" is a language rule, but it is also not the case that we must add an expose_addr in that example code. We only need to expose the address if we actually call bar as an extern C function, not if we inline it.

RalfJung commented 2 years ago

@digama0

Regarding Ralf's point about PartialEq on slices, I think we already have the necessary API to do the right thing here: ptr::guaranteed_eq can be specced to always return false for addresses in different domains.

I am not saying this cannot be fixed, but I think it is too much of a footgun.

@comex

Meanwhile, the LLVM IR level is at least one place where the compiler currently removes unused allocations. (I think it's currently the only place where it does so.)

Don't we ever remove unused MIR locals during MIR transformation? I would think we do. That is removing an unused allocation.

@JakobDegen

However, a compiler could be smarter. It could instead do things in two stages:

The systematic way to think about this is that the memory model changes when translating (for example) from LLVM IR to whatever the next IR is (I heard terms like SelectionDAG but I don't actually know the LLVM pipeline).

I don't quite see the FFI problem here -- when doing FFI you have to declare on which level you are doing the linking (are you linking two programs in LLVM IR, or in assembly), and the answers to various questions depend on that but it should all be coherent from what I can tell.

The most tricky bit is how exactly to formalize the condition that when you call expose_addr, the address must not conflict with any other exposed allocation. I think this needs some funky concepts like what this paper calls "no behavior" (NB), which in turn requires care while doing the proof when translating from this memory model to the one where allocations never overlap.

Basically:

when doing an allocation, non-deterministically (daemonically) pick an address; it can arbitrarily overlap with everything else (there also is always the option to fail with OOM, again via daemonic non-determinism)
in expose_addr, if this allocation overlaps with a previously exposed allocation, trigger NB

NB means it is the compiler's responsibility to prove that this never happens -- this is the exact opposite of UB. When arguing for correctness of the translation to a memory model without overlapping allocations, where usually we can assume that the source program has no UB, here we have to prove that the source program has no NB. But this is easy; in this translation we pick addresses to never overlap (we trigger OOM if they would overlap) so we are good.

JakobDegen commented 2 years ago

Don't we ever remove unused MIR locals during MIR transformation? I would think we do. That is removing an unused allocation.

We definitely do, in SimplifyLocals.

Wrt everything else, @RalfJung and @digama0 : I'm still not getting my point across properly, since I agree with everything that you said but still believe there is a problem. So let me instead try getting closer to a point where I can explain what is going on by asking a question. Consider, these two snippets:

Version Easy

extern "C" {
    fn foo_a(i32) -> i32;
}

pub fn bar_a() {
    foo_a(5);
}

Version Hard

extern "C" {
    fn foo_b(*mut i32);
}

pub fn bar_b() {
    let mut x = 5;
    let p: *mut i32 = &mut x as *mut _;
    foo_b(p);
}

There has already been quite a bit of discussion about FFI that asked questions of the form "what are the obligations that foo_a and foo_b must uphold?" The answer to this has been that - to the extend observable by the AM - they must have the behavior of some non-UB Rust function. That's all well and good, and I am happy with that requirement.

I'm instead talking about "what are the obligations that the Rust implementation must uphold wrt these function calls?" This is tricky, because we cannot just do this define this "within the AM" the way we define what let a = 5 + 10; does. After all, the extern call very much has to "leave the AM," because it interacts with non-abstract things on the hardware/OS (which we must now acknowledge exists).

How does the language around the answer to the previous question look in the spec? Below I'm going to describe a way that I imagine this could be defined, but you are free (and encouraged!) to completely ignore my description if it makes answering this question easier. To emphasize: I am by no means advocating for what I wrote below. I am only trying to explain the model that I have in my head.

How I thought this could work

One way I believe this could be defined is by defining a map of AM states to what I'll call "hardware states." The best way I think is to explain this by giving an example of the language that defines the behavior of the `foo_a(5)` function call. It might look something like this: When evaluating a call, if the target of that function call has the type of a function declared in an `extern` block, then, for each argument value: 1. We find the "location" where that value should be placed, as required by the ABI - this is either a register, or some location in hardware memory. 2. At this location, we perform a `store` of the argument value, mapping AM bytes to hardware bytes as follows: Int(val) => val, PtrFragment(val, tag) => val, Uninit => unspecified. 3. We call the `extern "C"` function, again as specified by the ABI. This is well and good for `foo_a(5)`, but it is insufficient for `foo_b(p)`. That's because we additionally need to guarantee that if the `extern "C"` function goes and dereferences the pointer it receives, it finds `5`. The easiest and most aggressive way we could fix this is by adding a clause like the following: 4a. Additionally, for every currently live AM allocation, the contents of that allocation are stored to the hardware memory with the same address. This is enough to guarantee that dereferencing the pointer has the desired behavior. Unfortunately though, it brings back the problem of "removing unused allocations is unsound," because now this is being miscompiled: #### Snippet D ```rust fn bad(limit: usize) { if limit == 0 { println!("Bottomed out"); // Pretend it's just an `extern` function return; } let mut x: usize = limit; let r = &mut x; bad(limit - 1); if *r != limit { panic!(); } return; } fn main() { bad(usize::MAX); println!("unreachable"); } ``` The cause of this is obviously that the phrase "for every currently live AM allocation" is too broad; most of these allocations can't be accessed by the extern function after all. So we could instead use a clause like this: 4b. For every particular byte in AM memory: If there exists an execution of the `extern` function which could read this byte, it is written to hardware memory at the same address. This kind of clause is [what I meant above](https://github.com/rust-lang/unsafe-code-guidelines/issues/328#issuecomment-1104466822) by > On FFI calls, we find all the pointers reachable from the arguments/globals and `.expose_addr()` them This recovers the soundness of the optimization in Snippet D, because the allocations do not have to be spilled and so we do not have to deal with finite memory. However, now that I've written all this out, this last version actually doesn't sound so bad. It's mostly workable - `extern "Rust"` is not affected as long as we don't ever give it a stable calling convention (and so can just say that the call happens in the AM, even across compilation units). The only critical point that I see with it is that with this version, removing allocations during LTO is unsound. That seems unfortunate, but maybe workable?

digama0 commented 2 years ago

I'm instead talking about "what are the obligations that the Rust implementation must uphold wrt these function calls?" This is tricky, because we cannot just do this define this "within the AM" the way we define what let a = 5 + 10; does. After all, the extern call very much has to "leave the AM," because it interacts with non-abstract things on the hardware/OS (which we must now acknowledge exists).

The shorthand I've been using here is that the answer to this question is that the rust implementation must ensure that the arguments are loaded according to the calling convention, so for example extern "C" functions must follow (our interpretation of) the native C calling convention for the platform.

Now this doesn't really answer the important questions, it just re-frames it as "what is the C calling convention according to the rust spec?". I'm not entirely sure if this needs to be a part of the spec proper, but it would appear if we consider the specification of what e.g. rustc would do to turn a bunch of .rs files into a .so/.dll shared library then certainly the details of the C calling convention would have to appear when explaining what is in those library files and how those functions behave when called.

I think your explanation about mapping AM states to hardware states is on point, although I would point out that it's a function not a relation (which is how things like "unspecified" work), but also (and possibly more importantly) this relation only needs to concern itself with things local to the call itself - your rule 4a is too broad. Here's a sketch of such a definition:

When an extern "C" fn(T1, ..., Tn) -> R function is called with AM arguments (v1, ..., vn):

For each i, the value vi of type Ti must be placed in registers [RDI, RSI, ...] (insert register convention here)
For i >= 6 the value vi of type Ti must be placed at architecture memory [RSP+8(i-6)] with argument convention
Furthermore, every global with external linkage must be reified

The way a value v of type T is placed in register R depends on T:

An integral value is stored in native endian according to processor conventions (insert ISA details here)
A pointer value Ptr(val, tag) is stored the same way as the integral value val would be, but tag must be reified in memory (see below)
Memory of less than sizeof(R) bytes can also be passed in a register by reifying the memory to a byte sequence and zero extending the result.

A value v of type T is stored with argument convention at address [a..a+8]:

If sizeof(T) <= 8 then v is reified to byte sequence [a..a+sizeof(T)]
otherwise the [a..a+8] contains a pointer value b and v is reified to byte sequence [b..b+sizeof(T)]

A tag tag is said to be reified in memory if the following conditions hold:

An invalid or expired tag is vacuously reified
If tag is valid and has access to the AM allocation M[a..b] in address domain d, then tag is reified in memory if d = bot (alternatively: d's 'exposed' flag is set in the AM) and M[a..b] is reified to some byte sequence v.

A sequence of AM bytes M[a..b] is reified to a byte sequence v if the following holds:

The range a..b in the architecture is allocated
v is the contents of architecture memory a..b
For each byte in a..b, the byte M[a+i] is reified to v[i]

An AM byte x is reified to a machine byte y when:

If x = Int(val) then y = val
If x = PtrFragment(val, tag) then y = val and tag is reified
If x = Uninit then true (y has any value)

The definition of reification here is intended to capture the idea of transitive reachability of an allocation by following tagged pointers starting from the arguments and the globals, each of which gives access to some allocation, which might contain more tags in it which must also be reified and so on.

Note that this is the calling convention: it only pertains to extern "C" functions, other calling conventions can have different rules. Moreover, the compiler can still make assumptions beyond this if it can still abide by the as-if rule: if we know additional things about the callees (for example if we know they come from a finite enumerated list such as during LTO), then we do not have to strictly adhere to the calling convention as long as the observable behavior is the same as if we did. (For example, if none of the potential callees access a particular global then it is okay to not reify that global even though the calling convention says it is required.)

JakobDegen commented 2 years ago

I'm glad that we agree on much of this.

I would like to focus on one point though: What you describe as "reification" is not a state, rather it is an operation. Specifically, the phrasing of this:

but tag must be reified in memory (see below)

Should be changed to

and tag is reified in memory (see below)

Reification being an operation rather than a state is critically important, because otherwise the phrase above is not representative of an operational semantics, but rather of a "magic promise" of the same kind as I keep referring to in Snippet B.

But as soon as we accept that reification is an operation, we run into all the problems that I was talking about. Specifically, I believe the first half of your paragraph is all correct:

Note that this is the calling convention: it only pertains to extern "C" functions, other calling conventions can have different rules. Moreover, the compiler can still make assumptions beyond this if it can still abide by the as-if rule: if we know additional things about the callees (for example if we know they come from a finite enumerated list such as during LTO), then we do not have to strictly adhere to the calling convention as long as the observable behavior is the same as if we did.

But then this example is not:

(For example, if none of the potential callees access a particular global then it is okay to not reify that global even though the calling convention says it is required.)

Reification is an operation with a global side effect - that is, after all, the entire point of this issue. Consider:

fn extern "C" nop(p: *mut u8) {}

static Foo: u16 = 0;

fn bar() {
    let p = alloca(usize::MAX - 1); // or `malloc(usize::MAX - 1)`
    nop(p);
}

If we guarantee reification of Foo and p at the nop call site, then this program must OOM. Removing the nop(p); call is unsound, because it also removes the reification, which means that it is permitted to no longer OOM.

More generally, if we have a finite bottom/exposed address space, then any operation that interacts with the amount of that address space that is available (for example by reducing it) has an observable side-effect.

Of course, this optimization not being allowed is really bad. But I also don't see how you can recover this without making compromises on the "reification is an operation" part.

Dante-Broggi commented 2 years ago

So, the way I see this being resolved is that there is actually both extern "C-reify" and extern "C-no-reify", and the compiler can choose per call which to lower extern "C" to. However, in order to implement the AM it is necessary for calls which cannot see the actually called MIR to use extern "C-reify", in case the AM function implementation uses expose_addr / from_exposed_addr.

Thus, reification is an operation, but it is only necessary when extern "C-reify" is used, which is when expose_addr / from_exposed_addr might be called.

JakobDegen commented 2 years ago

I mean, clearly it can't just choose which to lower it to without restriction, since then user's can't write correct code that calls foreign functions. Can you be more specific about what you imagine the rules will actually be?

Dante-Broggi commented 2 years ago

For calls where the compilation of the caller cannot see the exact MIR implementing the function, the compiler must use extern "C-reify" to call the function because the MIR might contain provenance-relevant operations. Then, if the function uses provenance-relevant operations, the compiler must use extern "C-reify" for the same reason. Otherwise, the compiler can use extern "C-no-reify" to call the function.

Finally, since both extern "C-reify" and extern "C-no-reify" lower to the same machine code for the callee, there is no codegen difference for having 2 applicable calling conventions.

JakobDegen commented 2 years ago

So, first of all note that PGO is still unsound under this suggestion. That being said though, what are "provenance relevant operations." Are those all derefs? Does that mean that if I put a *p; into nop above inlining it is once more unsound?

digama0 commented 2 years ago

Okay, this is good, I think we're getting to the heart of the disagreement because I'm not willing to admit the claim that reification is an operation. I like Ralf's terminology of "no behavior" (NB) and will use it here: the calling convention requires that the compiler arrange for certain facts to be true about the state at the time of the call. It does not magically change the state to make the facts true, and if those constraints are impossible to satisfy then the call is NB.

That's what is happening in your bar() example. The calling convention requires that you have to be in a state where p is a pointer to allocated memory of size usize::MAX - 1, and also Bar must be in memory, and since it is 2 bytes this is impossible; therefore the compiler is required not to emit a call to nop(), because the calling convention of the function can't be satisfied. Calling nop() is NB.

However, this function has a perfectly well defined behavior on the AM, and main() does nothing, so the compiler is permitted to optimize the entire function to an empty body. Since a function with an empty body doesn't call nop(), NB does not occur.

Generally, NB is a tricky thing to work with, because if you put impossible requirements on the compiler then nothing will be conforming, which is an undesirable situation. The way out is to make sure that each bit of NB is balanced out with the appropriate amount of compiler-chosen nondeterminism so that the compiler can choose a path that avoids the NB. In this case, we can say that the compiler is permitted to give an allocation error when allocating p, as well as a linker error when allocating Foo, so these are potential options if it wants to avoid calling nop() but has not inlined the call away.

Non-extern functions with a calling convention are an interesting situation here because it is not observable whether we actually called that function specifically: we could call any other function with equivalent AM semantics. That is, we could take the body of nop(), out-line it to another function nop2() with a different calling convention, and then inline nop() into all callers. If nop() is not public we can then delete it because it is now unused. Effectively this allows us to change the calling convention for all local callers and so drop the reification requirement; each of these changes is legal per the AM semantics. If nop() was pub we could still do this, but we would need to keep it around for external callers; those callers would be obligated to follow the calling convention if they emit a call to nop(), but if they had enough visibility they would also have the option to inline it and call nop2() instead.

JakobDegen commented 2 years ago

So I have two issues with what is written above. I'm intentionally going to avoid discussing one of them in favor of the other (I think the first can be better resolved later).

Even if we accept that reification is a state, I claim that I (as a user) can still prove that this code may not print "unreachable":

fn extern "C" nop(p: *mut u8) {
    println!("Unreachable");
}

static Foo: u16 = 0;

fn bar() {
    let p = alloca(usize::MAX - 1); // or `malloc(usize::MAX - 1)`
    nop(p);
}

Specifically, the spec guarantees that, at them moment we enter the nop function, we are in a reified state with respect to the Foo and p allocations. This is not possible with finite memory, hence this is NB. Therefore, the println can never be reached.

This is obviously bad. I would like to resolve this by resorting to the following: The spec defines a notion of "truly foreign functions" (TFFs). TFFs contain at least 1) extern functions for which there is no LTO data or 2) inline asm. We only promise reification on calls to TFFs, and not on any other calls, regardless of the ABI they have.

digama0 commented 2 years ago

So the way this breaks down is as follows:

No, you can't prove that the "Unreachable" cannot be reached. This program is well defined in the AM (which doesn't care about calling conventions as it models function calls "directly" with value passing) and can print "Unreachable".
Instead, you can prove that if "Unreachable" is reached then the function must not have been called "normally", using a regular function call with C calling convention. This is because the C calling convention imposes impossible constraints on such a call.
Contrapositively, if the compiler decides (either because it didn't want to inline or because this is spread across separate compilation units) that it will do a function call with C calling convention, then it must be forced to exit early with allocation failure, and the function call will never actually happen.

Specifically, the spec guarantees that, at them moment we enter the nop function, we are in a reified state with respect to the Foo and p allocations. This is not possible with finite memory, hence this is NB. Therefore, the println can never be reached.

The calling convention only applies to actual calls at the architectural level, not function calls in the AM. There is no promise that we are in a reified state when we enter the nop function, any more than we promise that the values of specific registers are anything in particular. The promise only applies at the function call boundary of a real call associated to a CALL instruction. There are plenty of ways a compiler can "call functions" without using the CALL instruction; this is the whole point of inlining.

RalfJung commented 2 years ago

I claim that I (as a user) can still prove that this code may not print "unreachable":

I don't think you can. Everything you wrote happens inside the Rust AM, using the Rust memory model, so (assuming the model proposed in this thread is adopted), allocations can overlap in memory.

(So, I agree with @digama0.)

digama0 commented 2 years ago

I want to ask a question here, which could be a separate issue but it is depends on the assumption that we are adopting this model so I'll put it here.

Which statics need to be reified at the time of a call? Jakob's example relies on a private static being visible, and in the example well formed rust code could in fact access that static since it's in the same module. What about code from another module, or another crate? We could say that if it is an extern "C" { fn foo(); } with the body of foo on the other side of FFI / in another crate, then it should respect the visibility constraints of the code, meaning that any private static variables don't have to exist in memory at any known address.

It seems a bit odd to have visibility involved in the calling convention like that, but I suppose there is some precedent for private functions not getting linked since they do not have any external linker symbols attached to them so they are "unobservable".

RalfJung commented 2 years ago

Which statics need to be reified at the time of a call?

I think this question applies even without this model, when considering things like inline assembly. For which statics do we guarantee that the inline assembly sees a state which represents the AM state at the same moment?

I think the answer is as for everything else -- for all statics which Rust code could access at this point. If the inline assembly "legitimately" got access to a static, it can use it. If there is some module whose contents we don't (entirely) know, it might leak the address of its static, so that also has to be reified. But if there is a module that we can "see" entirely and it definitely keeps the address of its static private, then it does not have to be reified. Visibility is involved in the sense that public states are obviously not kept private, but even private statics can be "leaked" so visibility is not sufficient.

digama0 commented 2 years ago

Note that with the rules for reification that I gave earlier, it already covers the case of "leaking" statics: if you pass a pointer to a static or the static is otherwise transitively reachable from the parameters or public statics, then it must be reified in memory anyway. So it sounds like it would suffice to only directly require reification of public (or otherwise visible to the callee) statics.

I think you also need a rule that any static that has been previously exposed must still be exposed and moreover must have its value updated to reflect the latest information from the AM at that allocation, even if that static is not being transitively passed during the current call.

RalfJung commented 2 years ago

FWIW I would think all these rules are just internal details of the refinement proof between Rust and assembly. They shouldn't have to be spelled out and are subject to change as Rust evolves and might adjust the way it maps the AM to real hardware.

digama0 commented 2 years ago

I disagree; there are variations on this kind of thing for extern "Rust" calls that are arguably just internal details that don't have to be specced, but if the "final product" is not just a complete executable but rather something with an externally visible FFI interface, then the spec has to say what that code blob "means" even though there is no final executable to do AM simulation on.

Certainly the details are subject to change (if the ABI of the platform itself changes, or our understanding of it changes, or the AM model changes), but this is not an internal implementation choice alone because the generated rust code has to play well with other things. It would be as stable as anything else in the spec.

To make this more concrete: we should be able to conclude from the spec alone that a conforming rust implementation will correctly link the code in the libc crate to the platform's libc implementation and will not commit any type errors across the function call boundary. If the details of extern "C" fn are not spelled out in the spec, then a conforming rust implementation could choose to use a different calling convention for such calls, and then they would not correctly link with, say, glibc. Under such conditions you would need to know rustc implementation details to be able to write FFI shim crates correctly, which is undesirable.

RalfJung commented 2 years ago

Well, okay, we'll have to specify parts of this reification relation.

I don't think we'll ever want to completely specify this though -- e.g., that would require specifying the layout of vtables. Some things remain in the land of "FFI/asm must not touch this".

JakobDegen commented 2 years ago

Why are vtable layouts special, ie different from all other #[repr(Rust)] things? Is it not sufficient to treat them as some kind of opaque memory that gets reified like everything else, but users trying to rely on that having a particular meaning are inevitably relying on unstable implementation details?

RalfJung commented 2 years ago

We could try this, but the space of possible reifications is larger: for repr(Rust), we basically say "the compiler will pick an offset for each field, and the fields won't overlap and will be suitably aligned, and that is all we tell you". Formally, the abstract machine is parameterized by a function that specifies the layout of (each monomorphization of) each repr(Rust) type, and that function is subject to these constrains.

For vtables, it's hard to say what the space of possible vtables is: sometimes we add entirely new things there, like for upcasting.

JakobDegen commented 2 years ago

Hmm, this is fair, I have some thoughts about it, but let me go and open a new issue.

Edit: Having thought about it some more, I think this is mostly a question of with "what even are vtables in the AM" and should not really be relevant here unless we have a definition there that actually causes us problems

JakobDegen commented 2 years ago

Edit: What's written below has changed following discussion on Zulip and follow-up comments

Oh, I had forgotten to mention this, but I have adjusted the writeup to reflect the most recent version. One notable thing I've done is I've renamed the "this allocation gets a non-overlapping address" operation to .force_addr(). Nothing else has really changed, it's just that we now have a forced flag on the allocation (instead of many address domains), and .force_addr() now possibly does NB instead of OOMing. With that cleared up, I don't know of any fundamental outstanding issues that make me believe this model is not an option.

Of course, that now leaves us with all the non-fundamental issues that this implies. I think most of these are of the form "when does .force_addr() get called." There are a lot of things we could potentially discuss here, but I'm going to bring up just one in an attempt to keep things manageable:

pub fn fast_eq<T: Eq>(a: &T, b: &T) -> bool {
    if (a as *const T) == (b as *const T) {
        return true;
    }

    *a == *b
}

Under the current proposal, this function is wrong (in the sense of "it doesn't do what you think it does"). In particular, fast_eq(&1, &2) may return true. I see three options for what we can do:

This is fine, and we don't need to do anything. Arguably this is a violation of our stability guarantees.
Equality comparison on pointers calls .force_addr() on each argument. The above code is now correct, but == on pointers possibly does NB and so might have to be treated as side-effecting by optimizations. No idea how bad that is in practice, probably not too bad.
We give pointer equality comparison a weird definition to make it pure again. Something like:
Pointer equality: (p_addr, p_alloc) == (q_addr, q_alloc) behaves as follows:
```
if p_addr != q_addr {
    return false;
}

if p_alloc != q_alloc {
    if !p_alloc.forced || !q_alloc.forced {
        return false;
    }
}

true
```
This should also recover the correctness of the above example. I should note that the precise definition I gave here may have issues in that it misses some edge cases, but I'm not too concerned about that right now. I think it's more important to first consider whether we might want something like this at all.

I am fairly strongly opposed to option 1. I think option 3 might be fine if we don't discover other weirdness with it (which seems very possible), and option 2 is almost definitely fine.

digama0 commented 2 years ago

Equality comparison on pointers calls .force_addr() on each argument. The above code is now correct, but == on pointers is side-effecting. No idea how bad that is in practice, probably not too bad.

Nit: NB is not a side effect, and so as a consequence neither is force_addr(). NB is a constraint, but once the constraint is satisfied NB just disappears. There is no temporal component to NB, so you can lift it up and down however you like. This is in fact the major advantage of the NB-based formulation over your original version, which was a side effect. (In case this is confusing: compare to the user's perspective of the claim that *ptr is a side effecting operation because it makes ptr not null. No it doesn't; it constrains ptr to not be null but it had better already have been nonnull when you do the operation.)

I think guaranteed_eq and guaranteed_ne need to be mentioned in conjunction with this question. What are their semantics? Are we about to define them to be equivalent to ==?

I think in practice option 3 will not be much different from option 2, because the compiler can't compile fn foo(x: *const (), y: *const ()) -> bool { x == y } without forcing x and y. But it's not as crazy as I thought at first. It would be nice if the definition were simplified to just clause (a), what is the issue with that definition? The idea would be that unless the compiler has enough visibility to determine whether p_alloc != q_alloc, it will just force both pointers to give you the equivalent of option 2, but it might be able to avoid this in some cases.

JakobDegen commented 2 years ago

If we have just clause (a), then

let p = alloca(8);
let q = alloca(8);
p.force_addr();
q.force_addr();

p.with_addr(q.addr()) == q

would be expected to yield false, which is clearly impossible to implement.

rust-lang / unsafe-code-guidelines

Removing allocations is technically unsound in OOM (out-of-memory) situations #328

Version Easy

Version Hard