Provenance: storing to/loading from arbitrary addresses in an interpreter's registers

anp commented 6 months ago

Many interpreters operate as "register machines" where the bytecode operations reference one of several named registers for operands, performing stores and loads using register values as addresses.

For example, the rbpf interpreter loop represents registers as u64, and converts them to pointer types as appropriate for the operation.[^1]

How can this kind of memory access be made sound under pointer provenance? My impression is that the work on provenance has so far focused on embedded use cases like statically-known MMIO addresses, but I'm not sure how to use something like the foreign_addr proposal here.

cc @qsr

[^1]: Note that eBPF interpreters are typically paired with a static verifier which prevents execution of the program if it would introduce data races or otherwise perturb Rust-owned state. I believe that eBPF's semantics will require leaning heavily on a verifier for soundness in a Rust implementation, I'm just not sure how to account for provenance in that implementation.

Lokathor commented 6 months ago

The u64 address values are within the context of the interpreter's memory space, right? So they're basically indexes into a Vec<u8> or whatever that holds all the interpreter memory?

RalfJung commented 6 months ago

I don't quite follow, could you be a bit more concrete? Where are these pointers that the interpreted program manages pointing to? How does the interpreter ensure they remain valid?

One option here is to not use u64 as the register representation. If you truly want the interpreted program to be able to hold and manipulate pointers of the host machine, then you can't use an integer type to represent interpreter state. You need a type that is like "union of integer and pointer", and in languages with provenance like C or Rust, that union is not just an integer type. In Rust, in fact a pointer type works for that purpose: a *const () can hold all values that usize can hold, and furthermore it can also hold provenance.

saethlin commented 6 months ago

I'm rather confused by the question. As far as I can tell, the rbpf interpreter loop has no need of unsafe code at all, and might not even be gaining any performance from using unsafe code because it implements its own bounds checking then does raw pointer reads/writes.

anp commented 6 months ago

The u64 address values are within the context of the interpreter's memory space, right? So they're basically indexes into a Vec or whatever that holds all the interpreter memory?

Sorry, I should have provided more background about eBPF, and perhaps rbpf wasn't the best example. When it's implemented in a context similar to the Linux kernel, the addresses are AIUI arbitrary kernel memory. A static verifier is used to limit that access to scratch memory but the address space is 1:1 with the kernel.

I don't quite follow, could you be a bit more concrete? Where are these pointers that the interpreted program manages pointing to? How does the interpreter ensure they remain valid?

The pointers point to arbitrary host memory, for example the Linux kernel's address space. AIUI, the eBPF verifier is used to ensure they remain valid, but the virtual machine is technically able to store/load from arbitrary addresses in the host program's address space.

In Rust, in fact a pointer type works for that purpose: a *const () can hold all values that usize can hold, and furthermore it can also hold provenance.

This is good to know, but I'm not clear on how provenance can be established in the first place when the pointers are being stored into registers by arbitrary bytecode instructions that may or may not have provenance over the pointed-to allocations (or may point to addresses that are unallocated).

As far as I can tell, the rbpf interpreter loop has no need of unsafe code at all

Yeah upon further reading rbpf may not be the best example here because it limits the memory that the instructions can access. However the use case I care about is implementing eBPF support for a Linux compatibility layer where AIUI the support needs to be able to access arbitrary memory.

CAD97 commented 6 months ago

In short, you would need to determine what bytecode operations need to preserve provenance, then implement the interpreter in such a way that sufficient provenance is maintained. This primarily means using MaybeUninit and/or something like iptr when interacting with untyped memory that may carry provenance.

So for something like compiling eBPF or WASM using native heap instead of a managed heap while their VM doesn't distinguish between pointers and integers, you'll need to preserve provenance for all integer operations. I fully expect a managed heap approach to be more straightforward, though.

anp commented 6 months ago

Perhaps the crux of my question is how can a bytecode interpreter establish provenance in the first place?

So for something like compiling eBPF

FWIW it's not always compiled, most eBPF implementations include an interpreter AFAIK.

I fully expect a managed heap approach to be more straightforward, though.

Is there a path to writing a sound interpreter in Rust without a managed heap? I don't think the VM's semantics will allow for that but would accept a correction from someone more familiar with eBPF.

Lokathor commented 6 months ago

How would this be done soundly in C? Rust has to follow a provenance model exactly because C already has followed one for ages. If this can be done in C, probably Rust can "just do that".

RalfJung commented 6 months ago

Wow, that's an Extremely Cursed architecture.^^

The pointers point to arbitrary host memory, for example the Linux kernel's address space. AIUI, the eBPF verifier is used to ensure they remain valid, but the virtual machine is technically able to store/load from arbitrary addresses in the host program's address space.

Surely it can't be arbitrary. How does the validator determine which objects can be accessed? Is there such a notion as "publicly accessible objects"? How is the eBPF program prevented from just taking a pointer, offsetting it by N (going out of bounds, or into the bounds of an object that is not "publicly accessible") and accessing that? Surely the validator is placing some tight limits on how pointers can be computed?

Also for doing stores, how does this ensure that the store doesn't completely destroy the state of the object being written to?

anp commented 6 months ago

Did a bit more background research, and it seems that verified eBPF programs should only interact with their own stack and with pointers provided as input parameters to the hook/program or as returned from helper functions. It's technically possible for an incorrect program to do loads/stores on other addresses, but this should be prevented by the verifier before execution. So I think that addresses the question of how to establish provenance in the first place -- as long as those argument/return types are provenance-preserving we should be fine?

Regarding the types to use for registers/parameters/return values, is *const () actually what we want? I don't think that can be the type used for registers because that would require casting non-pointer u64's to the pointer type when the register is holding a value other than a pointer. Is there a type that will preserve provenance if a pointer is stored into it but which can also store an integer? Maybe something like MaybeUninit<[u8; 8]>?

saethlin commented 6 months ago

Is there a type that will preserve provenance if a pointer is stored into it but which can also store an integer?

A pointer.

anp commented 6 months ago

A pointer.

I thought provenance meant we had to stop doing int2ptr casts?

bjorn3 commented 6 months ago

There are two distinct operations that accept an integer and return a pointer:

ptr::from_exposed_addr: This requires a pointer to be previously exposed using ptr.expose_addr(). This operation is not compatible with CHERI.
ptr::without_provenance: This takes an integer and produces a pointer with invalid provenance. You can later turn it back into an identical integer, but are not allowed to dereference or otherwise operate on it in a way that requires correct provenance.

Lokathor commented 6 months ago

Any int can be safely casted to a pointer, and if you later need to do int stuff instead of pointer stuff you can just cast it back to an int.

But if you make the "base" value of your interpreter be pointers it'll be able to store all that you need to store.

anp commented 6 months ago

OK, that makes sense! I don't see ptr::without_provenance in the current unstable APIs, if we implement these registers stored at *const () now, will we need to follow up to make int2ptr casts use ptr::without_provenance or something similar later? Or will we be OK to continue doing u64_value as *const ()?

Lokathor commented 6 months ago

Honestly, the real answer is probably "just use as casts, and we'll have a better answer soon". There's ongoing discussion and an effort to push this forward, but even if all the methods are fixed up and perfect then as casts will take a long time for the ecosystem to migrate away from so they'll keep working without so much as a warning for quite some time.

saethlin commented 6 months ago

The casts are at least as well-defined as doing any other conversions, so yes in some sense it is okay.

We generally caution against it because:

When you cast an integer back to a pointer it's not entirely clear what accesses that pointer is allowed to do. Our current approach in Miri and probably for all checkers will be to have false negatives. So if you use integer-to-pointer casts, you can't rely on checkers the way you usually do.
The implementation of LLVM does not acknowledge that pointer-to-integer casts have a side effect, which means that if you rely too much on the permissive semantics that we will probably assign to pointer-to-integer-to-pointer casts, you may be betrayed by the optimizer. In practice this is unlikely to be an issue, if your application gets miscompiled and you manage to find that before it causes trouble, I'm sure LLVM or GCC would be willing to do something about a specific example. See https://www.ralfj.de/blog/2022/04/11/provenance-exposed.html for more information about the problem, which also links to https://github.com/llvm/llvm-project/issues/34577 and https://github.com/llvm/llvm-project/issues/33896

saethlin commented 6 months ago

Here is an example of current Miri missing UB due to pointer-integer-pointer casts:

fn main() {
    let alloc1 = 1u8;
    let alloc2 = 2u8;
    // Expose both allocations
    let addr1: usize = &alloc1 as *const u8 as usize;
    let addr2: usize = &alloc2 as *const u8 as usize;

    // Cast addr1 back to a pointer. In Miri, this gives it Wildcard provenance.
    let wildcard = addr1 as *const u8;
    unsafe {
        // Read through the wildcard
        println!("{}", *wildcard);
        // Offset the pointer to another allocation.
        // Note that we are doing this arithmetic that does not require we stay within bounds of the allocation.
        let wildcard = wildcard.wrapping_offset(addr2 as isize - addr1 as isize);
        // This should report UB:
        println!("{}", *wildcard);
        // ... but it doesn't. A pointer's provenance specifies a single allocation that it is allowed to read from.
        // And wrapping_offset only modifies the address, not the provenance.
        // So which allocation is wildcard allowed to access? It cannot be both.
    }
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=32e14abd12395b84cdc86fcaf6be2180

This specific example is probably fixable in Miri, because it only relies on allocation-level knowledge. But more nuanced examples are probably not fixable.

Lokathor commented 6 months ago

It's the interpreted program that's doing the pointer manipulation, right? If so, then as far as I can tell the interpreter literally can't know what the correct provenance to tag pointers it creates with. The interpreted language would have to itself have a provenance model (I suspect it does not).

anp commented 6 months ago

It's the interpreted program that's doing the pointer manipulation, right? If so, then as far as I can tell the interpreter literally can't know what the correct provenance to tag pointers it creates with. The interpreted language would have to itself have a provenance model (I suspect it does not).

Correct, I think the best we can do is pass values through a "maybe has provenance" type and only operate on them when the static verifier says they're OK to use.

anp commented 6 months ago

Here is an example of current Miri missing UB due to pointer-integer-pointer casts:

Just to confirm my understanding, we shouldn't hit this problem if we're doing integer-point-integer casts without doing any pointer operations on the middle state, right? We'd only hit issues if we send a pointer through an integer and lose provenance?

saethlin commented 6 months ago

Just to confirm my understanding, we shouldn't hit this problem if we're doing integer-point-integer casts without doing any pointer operations on the middle state, right? We'd only hit issues if we send a pointer through an integer and lose provenance?

Yes.

anp commented 6 months ago

This is all very helpful information, thanks! I think we have a path forward:

use pointer representations for all "either an integer or pointer" values (registers, parameters, return values)
if we need to perform arithmetic on integer values that may have already had provenance, use addr()/with_addr() to produce pointers for storage once those APIs are stable
ensure the static verifier only allows load/store operations on memory locations for which the interpreter has provenance (program parameters, the stack, return values from helper functions), and pointers derived from those

Does that sound right?

Re (1), is there a meaningful difference between *const () and *const u8?

chorman0773 commented 6 months ago

Actually, there's an issue with doing typeless arithmetic with pointer operations:

What's the semantics of add r0, r0, r1. More specifically, which operand do you preserve provenance from?

Lokathor commented 6 months ago

well since you're doing it in u64 there's no provenance at all. It would be the wildcard provenance when you as *const() the u64 back to a pointer. Or the provenance of a specific source pointer if you use p.with_addr(u)

chorman0773 commented 6 months ago

Using with_addr though, if the pointers can have different provenance, which pointer operand would you use as the source pointer?

On Thu, Mar 14, 2024 at 16:05 Lokathor @.***> wrote:

well since you're doing it in u64 there's no provenance at all. It would be the wildcard provenance when you as *const() the u64 back to a pointer. Or the provenance of a specific source pointer if you use p.with_addr(u)

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/unsafe-code-guidelines/issues/497#issuecomment-1998331462, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGLD22RZAF6WHRE7IRKTWDYYH7J3AVCNFSM6AAAAABEQ4KPFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJYGMZTCNBWGI . You are receiving this because you commented.Message ID: @.***>

Lokathor commented 6 months ago

Yeah, if the interpreter is working with more than one allocation chunk that sounds really hard to handle. As I said before, I think the interpreter literally can't know the answer, because it depends on the program being interpreted. The only useful suggestion I would have is to "expose" every single allocation for ptr2int and then "from_exposed" when doing int2ptr, then at least there's the best possible chance to avoid UB. Even then, UB is entirely possible. This interpreter's entire requirement set seems fundamentally unsound.

anp commented 6 months ago

The only useful suggestion I would have is to "expose" every single allocation for ptr2int and then "from_exposed" when doing int2ptr, then at least there's the best possible chance to avoid UB. Even then, UB is entirely possible.

Can you say more about what kind of UB would be possible if we treated every pointer as exposed?

Lokathor commented 6 months ago

What I'm imagining is that if there's a single allocation then it's easy and cheap to do bounds checks on each access, but if there's all sorts of allocations then the interpreter doesn't have a way to bounds check each access without massive performance loss.

Also, what saethlin demonstrated in https://github.com/rust-lang/unsafe-code-guidelines/issues/497#issuecomment-1995917111. If you've got more than one exposed allocation, an int cast back to a pointer is still "supposed" to only actually act like it's from a single specific allocation. However, miri does not currently have this level of granularity. Even if you did check some programs in miri, you definitely can't be running miri with every program the interpreter interprets. Unless the language being interpreted somehow guards against this (i suspect it does not), then it will always be a problem source.

anp commented 6 months ago

the interpreter doesn't have a way to bounds check each access without massive performance loss.

Correct, but eBPF's bytecode is designed to have guaranteed termination and allow for up-front analysis of memory accesses. The verifier should prevent any OOB accesses before execution.

Lokathor commented 6 months ago

Well then assuming all that, you should be covered

riking commented 6 months ago

You could send information through from the verifier about what memory areas each instruction accesses to the interpreter so that the execution engine can select the correct base pointer among (parameters..., stack, accessed_from_helper...).

anp commented 6 months ago

Would using exposed addresses inhibit important optimizations in code outside of the eBPF interpreter? Part of the goal for eBPF is to integrate closely with host data structures, and it would be good to not take a soundness approach that prevented other parts of the program from being optimized more aggressively.

Lokathor commented 6 months ago

Very broadly, not exposing an address can allow more optimizations to happen. eg: a pointer that doesn't escape can't be modified by a function call, so you can skip reloading the value after the function is called if it was already in a register before the call.

However, once you're merging two or more libraries you realistically won't really get those sorts of micro-optimizations anyway, so I'm doubtful it would have any major impact in practice.

saethlin commented 6 months ago

I'm still concerned that either you (@anp) or I are incorrectly reasoning about eBPF specifically, which is why I've been trying to just answer questions about provenance in general, and that latest comment makes me even more concerned.

I think eBPF is best modeled here like FFI. And in terms of FFI, we do not attempt the kind of reasoning that I think you are trying to engage in here. You do not reason about provenance through the FFI boundary in the manner I think you are attempting. Instead, you view the scenario as the optimizer must, because the singular purpose of provenance is for optimizers. So long as it is possible to suppose that the behavior of the other side of the FFI is implemented in terms of whatever system is on this side of the FFI, the optimizer assumes that it is and that's that. This means that if you expose a pointer to FFI then read it back later after it may have been written to, the optimizer must be very pessimistic about what provenance that pointer has. You can read Ralf's phrasing of this concept here: https://github.com/rust-lang/unsafe-code-guidelines/issues/421#issuecomment-1597313701

anp commented 6 months ago

I think eBPF is best modeled here like FFI.

Could you expand a bit on this? Do you mean that by exposing all in/out pointers touched by the interpreter that it would behave as if doing FFI?

saethlin commented 6 months ago

Do you mean that by exposing all in/out pointers touched by the interpreter that it would behave as if doing FFI?

No. I mean that from an opsem perspective, that the implications of having an eBPF interpreter running seem to me the same as doing FFI. I realize now that I probably chose to use the word "expose" above poorly. What I probably should have said is more like:

"This means that if you permit FFI to write to a pointer then read that pointer back later after it may have been written to, the optimizer must be very pessimistic about what provenance that pointer has."

Does that clarify at all?

RalfJung commented 6 months ago

Broadly summarizing what was said above, I'd say there are two strategies you could follow on the interpreter side: strict provenance or "fuzzy" provenance.

With strict provenance, you don't use as casts and instead attempt to perfectly track the provenance of pointer values during execution via with_addr. However, it seems like eBPF was designed in a way that makes this hard or impossible. @chorman0773 brought this up above: If the interpreted program is allowed to just do "x + y" and the verifier will allow either operand to be a pointer and the other to be an integer and then treat the result as "ptr + offset"... that means the interpreter has no way to know which provenance the result of the addition should have, the one of x or the one of y. IMO it's a design mistake in eBPF to mix up integers and pointers like that, but I'm not entirely surprised that this happened. (This does mean optimizing eBPF programs is extremely difficult since pointers inside eBPF apparently do not have provenance, so alias analysis has to be extremely conservative.)
The "fuzzy provenance" alternative is to expose_addr every pointer that the eBPF program may use, to use u64 throughout, and to from_exposed_addr any time an eBPF value needs to be used as a pointer -- and then rely on it "magically" picking up the right provenance like the usual ptr2int2ptr roundtrips.

Either way you need to make sure that these pointers that eBPF can hold (either via explicitly tracked provenance or via exposing the provenance) do not alias with Rust references that may exist in the kernel at the same time. Rust references assume non-aliasing (or, for shared references, read-only-aliasing) and this is true even with "fuzzy provenance". That's hopefully not too surprising, the eBPF program is after all acting on certain objects so if Rust code acted on the same objects at the same time that would violate "aliasing XOR mutation".

anp commented 6 months ago

Broadly summarizing what was said above, I'd say there are two strategies you could follow on the interpreter side: strict provenance or "fuzzy" provenance.

I think I misspoke when I asserted that each operation could be restricted to at most one pointer type. So even if we implemented a verifier that retained some information about which operand was a pointer, the verifier would need to somehow verify that two-pointer operations had the same provenance. I'm not sure how that's possible, so I think a "strict provenance" approach might not be feasible here.

(This does mean optimizing eBPF programs is extremely difficult since pointers inside eBPF apparently do not have provenance, so alias analysis has to be extremely conservative.)

My understanding is that eBPF is not designed to be optimized, it's a very low-level bytecode which can be translated to machine code without much complication. So maybe that's not an issue in practice for JIT implementations?

The "fuzzy provenance" alternative is to expose_addr every pointer that the eBPF program may use, to use u64 throughout, and to from_exposed_addr any time an eBPF value needs to be used as a pointer -- and then rely on it "magically" picking up the right provenance like the usual ptr2int2ptr roundtrips.

Good to know! Is sptr the best way to express that intent today?

That's hopefully not too surprising, the eBPF program is after all acting on certain objects so if Rust code acted on the same objects at the same time that would violate "aliasing XOR mutation".

That makes sense. Is it sound for the interpreter to mutate through pointers that were obtained from a mutable reference held by the thread which invokes the interpreter? Naively I'd assume so but this topic has enough nuance that I think it's worth double-checking.

Lokathor commented 6 months ago

Yes, the interpreter can be given a mutable reference to some memory (eg: &mut [u64]) and then access that memory through a mutable pointer made from the mutable reference

pub fn interpret_program(prog: &Program, mem: &mut [u64]) {
  let p = mem.as_mut_pointer();
  // interpret the program here and use `p`
}

RalfJung commented 6 months ago

Indeed -- but if you rely on "fuzzy provenance" you need to make sure you turn the mutable reference to a raw pointer and cast to int and then start the interpreter -- even if the interpreter is started multiple times, this needs to be done each time, as each time you do such a cast that just "publishes" that one reference for use by the interpreter.

Basically there needs to be an actual proper chain from "the one unique pointer that can access this memory right now (according to Rust)" to an expose_addr call before any from_exposed_addr that may alias that pointer. And when a new unique pointer is created, the old chain "dies".

Good to know! Is sptr the best way to express that intent today?

Either that, or just add expose_addr/from_exposed_addr methods to your crate and implement them with as casts.

I think I misspoke when I asserted that each operation could be restricted to at most one pointer type. So even if we implemented a verifier that retained some information about which operand was a pointer, the verifier would need to somehow verify that two-pointer operations had the same provenance. I'm not sure how that's possible, so I think a "strict provenance" approach might not be feasible here.

Looks like that operation is intended to return an integer, i.e., data with no provenance?

FWIW, a proper provenance-aware interpreter would still be possible if either the validator annotated the code to make explicit with values have provenance, or a bit is sacrificed at runtime to store whether any given value has provenance. But likely neither of them go well with the design ideas of eBFP, so it's probably best to discard this subthread and focus on the exposed-provenance option.

anp commented 6 months ago

OK, thanks for all the discussion folks! I think exposing the addresses makes the most sense now, if we find that we need to get more optimizations unlocked in the future and we don't have an option to implement an eBPF JIT then we'll probably have to look at storing a bit for each operand from the verifier to indicate whether the operand should have provenance.

rust-lang / unsafe-code-guidelines

Provenance: storing to/loading from arbitrary addresses in an interpreter's registers #497