immunant / c2rust

Migrate C code to Rust
https://c2rust.com/
Other
3.78k stars 219 forks source link

analyze: proposal for simplified analysis and rewriting of allocations #1097

Open spernsteiner opened 3 weeks ago

spernsteiner commented 3 weeks ago

Correct rewriting of malloc/free to Box<T> requires tracking ownership throughout the program. This proposal describes an approach that produces correct rewrites with minimal new analysis work, at the cost of introducing run-time panics when the C code does not follow Rust ownership discipline.

The main idea is to rewrite owning pointers not to Box<T> but to Option<Box<T>>. The Option is Some if ownership still resides at this location, and it is None if ownership has been transferred to some other location. Accessing a location after ownership has been transferred away will cause a run-time panic due to unwrapping None.

Definitions

A "pointer location" is a local variable or memory location of pointer type. MIR temporaries representing the results of intermediate expressions are included in this definition.

A "heap object" is a region of memory allocated by malloc or a similar function. A new object is "live" until it is freed by passing a pointer to the object to free or a similar function.

"Ownership" is a relation between objects and pointer locations. At all points during execution, each live object is owned by one or more pointer locations. (Single ownership of an object corresponds to Box<T>; multiple ownership corresponds to Rc<T>.) If a pointer location owns an object, then the value in that location is a pointer to the owned object. In some situations, copying a value from a source pointer location to a destination may also transfer or copy ownership from the source to the destination.

Rewrites

This section describes the new rewrites we would perform under this proposal. The next section describes analysis changes that enable these rewrites.

Type rewrites: Many pointer locations function as "borrowed references" and never hold ownership of an object in any execution of the program. These are handled by existing rewrites. For locations that may hold ownership, we add a new rewrite that replaces *const T/*mut T with Option<Box<T>>.

This approach to type rewrites means that for any pointer location that might hold ownership at some points in some executions, we assume that all assignments to that pointer location also grant ownership. We call such a pointer location an "owning pointer location". Note that an owning pointer location is not guaranteed to hold ownership throughout its lifetime; ownership may be transferred away during an assignment from the pointer location in question. This is why we must rewrite to Option<Box<T>> rather than Box<T>.

Expression rewrites: In every (static) assignment operation q = p on pointers, the LHS and RHS can each be either an owning pointer location or a non-owning pointer location.

Similar rewrites apply to pseudo-assignments. For example, f(p) may be rewritten to f(Some(p.take().unwrap())) if the argument of f is an owning pointer location.

We must also rewrite several libc functions:

Analysis

For distinguishing owned and non-owned pointer locations, we can use the existing FREE permission. FREE is set on the arguments of the libc free() function and is propagated backwards. This means all pointer locations from which a pointer might flow to free will be considered owning, and all others will be non-owning.

Because Box<T> cannot be used for pointers to the stack (unlike &/&mut), it may be useful for debugging to add a simple analysis to identify pointer locations that may contain pointers to the stack. This can be implemented as a forward-propagated HEAP permission to identify pointers that must point to the heap (or be null). This can be implemented like NON_NULL: set HEAP initially for all pointer locations, and remove it for pointers that result from &local or a similar expression.

In the future, we may want to implement a linearity analysis to detect cases where a single object is potentially owned by multiple pointer locations (and thus should be rewritten using Rc<T>). For this proposal, we don't use such an analysis and instead introduce panics in cases of multiple ownership.

Interaction with NON_NULL

In this proposal, we rewrite all potentially-owning pointer locations to Option<Box<T>>. This means a pointer location that is both owning and nullable would have the type Option<Option<Box<T>>>, which is confusing. For clarity, we may wish to use a different type, such as Result<Box<T>, OwnershipLost> (where OwnershipLost is a zero-sized error type), for owning locations. Then a nullable owning pointer location would have type Option<Result<Box<T>, OwnershipLost>>, where None means the pointer location was assigned null, Some(Ok(p)) means it was assigned a valid pointer and still has ownership of the object, and Some(Err(OwnershipLost)) means it was previously assigned a valid pointer but ownership was since transferred away.

ahomescu commented 2 weeks ago

Non-owning to owning: This is an error for the rewriter. There is no reasonable safe way to convert &T/&mut T to Box<T>

How about Copy or Clone where those are supported? It's not always sound, but it could work in some cases.

ahomescu commented 2 weeks ago

Some(p.take().unwrap()) could be p.take().or_else(|| panic!("Got None")) or something like that (not that it made it shorter).

ahomescu commented 2 weeks ago

we rewrite the malloc call either to Box::new(T::default())

We could also use Box::new_uninit, if we can live with it being nightly and unsafe.

ahomescu commented 2 weeks ago

Then a nullable owning pointer location would have type Option<Result<Box<T>, OwnershipLost>>

I'm generally in favor of using standard types, but this is a bit clunky. We could just use our own 3-state type e.g.

pub enum Owner<T> {
    Valid(T),
    Null,
    LostOwnership,
}
spernsteiner commented 2 weeks ago

How about Copy or Clone [to produce Box from &T] where those are supported?

In some cases this would change behavior (due to the change in aliasing), and I think it would be hard to tell whether a given instance of this rewrite changes behavior or not.

We could also use Box::new_uninit, if we can live with it being nightly and unsafe.

The fact that it's unsafe would be a problem for our safety story. Currently, if all goes well with the analysis and rewriting, then the result has no unsafe code at all. This saves us from needing more complicated analyses, like figuring out whether all fields are actually initialized following Box::new_uninit.

I'm generally in favor of using standard types, but this is a bit clunky. We could just use our own 3-state type

A downside of the 3-state type is that it's not compositional. But I agree it's clunky to write out the type in full. I could see having an alias like type Owned<T> = Result<Box<T>, OwnershipLost>, or maybe making it a custom type (Result doesn't have a method like Option::take).

spernsteiner commented 5 days ago

While implementing this, I ran into an issue involving FREE and OFFSET permissions. Example code:

let p = malloc(...) as *mut i32;
*p.offset(1) = 1;
let q = p;
free(q as *mut c_void);

p is inferred to have the FREE permission (because it flows into free) and the OFFSET permission (because it flows into p.offset(1), so its rewritten type is Box<[i32]>. q is inferred to have FREE, but not OFFSET, so its rewritten type is Box<i32>. The rewriter thus attempts to insert a cast from Box<[i32]> to Box<i32> at the q = p assignment, but that cast isn't currently supported.

Option 1: add support for the cast. For example, q = Box::new(mem::take(&mut p[0])); drop(p);. One downside of this is that it will panic if p.len() == 0 (which is likely uncommon). Another downside is that it will change the allocation behavior of the program in a way that the user may find surprising.

Option 2: unify PointerIds along the path from malloc to free. The type at drop() time generally must be the same type that was produced by Box::new, so all variables along the path from Box::new to drop must have the same type. We could enforce this by giving all of those pointers the same PointerId, so all will have identical permissions and flags. Specifically, we could unify the PointerIds of the LHS and RHS of any assignment where the output has FREE. A downside of this approach is that it would require some refactoring, as currently we handle unification before running any dataflow analysis, but running dataflow is required to propagate FREE.

Option 3: modify dataflow constraints. Assignments are normally allowed to discard any subset of permissions (the LHS permissions can be any subset of the RHS permissions). We could instead forbid assignments from discarding OFFSET while retaining FREE. That is, LHS must be a subset of RHS, but if LHS has FREE and RHS has OFFSET, then LHS must also have OFFSET. This prevents the case that requires the unsupported cast, where the RHS has OFFSET | FREE but the LHS has only FREE. A downside of this approach is that it adds a somewhat subtle special case to the dataflow analysis.

I've implemented option 3 for now, but it might be cleaner to switch to option 2.