mmtk / mmtk-core

Memory Management ToolKit
https://www.mmtk.io
Other
379 stars 69 forks source link

Rethink about whether creating Address from usize should be unsafe #1227

Open wks opened 2 weeks ago

wks commented 2 weeks ago

TL;DR: Currently, Address is unsafe to create from arbitrary values. But should it be? This paper says "Address creation from arbitrary integers is forbidden", but we should rethink about that.

The Address type has many unsafe methods. Among them, there are methods for

Understandably, memory accesses are unsafe because of potential races and unaligned/illegal memory operations, and conversions to Rust reference types should be unsafe because only the programmer can guarantee such conversions don't violate Rust's ownership and borrowing rules.

However, it's worth discussing the safety of creating Address instances. The following methods create Address instances, and some of them are marked as unsafe.

According to the 2016 paper Rust as a Language for High Performance GC Implementation by @qinsoon et al.,

Addresses and object references are two distinct abstract concepts in GC implementations: an address represents an arbitrary location in the memory space managed by the GC and address arithmetic is allowed (and necessary) on the address type, while an object reference...

and

We restrict the creation of Address to be either from raw pointers, which may be acquired from mmap and malloc, or derived from an existing Address. Address creation from arbitrary integers is forbidden, with the single exception of the constant Address::zero(). This serves as an initial value for some fields of type Address within other structs, since Rust does not allow structs with uninitialized fields...

The paper did not explain why "Address creation from arbitrary integers is forbidden", and it apparently contradicts with "an address represents an arbitrary location in the memory space managed by the GC". In the current Rust MMTk, Address::zero(), Address::max() and Address::from_usize are all marked as unsafe. Their doc comments say:

...It is unsafe and the user needs to be aware that they may create an invalid address...

It is unclear what "invalid address" is. Since creating Address from pointers and ObjectReference is considered safe, it seems to imply that Address is supposed to point to somewhere "safe", such as inside an object (But by that time the raw address of ObjectReference was not guaranteed to be inside an object, until https://github.com/mmtk/mmtk-core/pull/1195.) or inside a memory region obtained by mmap or malloc.

And since address arithmetics are considered safe, we can get an Address safely from any ObjectReference, and call addr.and(0).add(0x12345678) to "safely" create an address from an arbitrary address 0x12345678. That bypasses the unsafe annotations on the creation methods.

In fact, in the current Rust MMTk code base, we use Address to point to quite many things that are not inside objects. Things like Chunk, Block and Line are wrappers of Address. We can also derive sub-regions from their parents, such as iterating through all Blocks from a Chunk, or all Lines in a Block, and they are both considered safe. We also have linear scanning algorithms that go through every byte, and that is considered safe, too.

So I think it is pointless to mark zero(), max() and from_usize() as unsafe. Address is just what it is: an address, an arbitrary address. There is no validity guarantee of an Address anyway. It can be zero, be word-aligned or not, be inside the heap or not, be addressable by a 64-bit Intel CPU or not. There should be no restriction on creating Address. Only memory accesses and methods that create Rust references should be unsafe. And it should be unsafe to convert Address to ObjectReference, too, because ObjectReference does have the concept of validity (which can be checked by the valid-object (VO) bit).

qinsoon commented 2 weeks ago

The intent was that an Address should originate only from mmap, malloc, or similar sources, or to be derived from existing Addresses. Any attempt to create an arbitrary Address was considered "unsafe." However, this approach hasn’t worked out very well. We can remove the "unsafe".

wks commented 2 weeks ago

We discussed this topic today.

The "validity" of an Address is not well-defined in general, but there are definitions of "validity" in different contexts. For example, a block must be 32 KB in size and aligned to 32 KB, and an ObjectReference (currently defined as an address) must be within an object and must be word-aligned. In those cases, deriving addresses in one way may be safer than others. For example, we can derive the address of blocks in a chunk, given the chunk address. And we can derive chunk addresses from a memory region from mmap.

Currently, we have several types that wrap Address, usize or NonZeroUsize, including Chunk, Block (from both MarkSweepSpace and ImmixSpace), Line, ObjectReference, etc. We think it is better that whenever the code needs an address that satisfies certain definitions of validity, we use one of those wrapper types, or create one if necessary. We still use Address if there are no applicable definitions of validity.

wks commented 1 week ago

If I use the pagemap crate to parse the special file /proc/self/maps of an MMTk instance, it will return many MapsEntry structs. We can get the range of the entry by calling entry.memory_region() which returns a MemoryRegion which contains a start_address and last_address, both represented as u64. (I think it may allow a 32-bit program (-m32) to inspect the pages of a 64-bit process.)

In this scenario, I think it is very "valid" to convert the start_address and the last_address to MMTk's Address type. After that, I can compare the resulting Address against the memory ranges of spaces (i.e. CommonSpace::start which is an Address) in the MMTK instance and see whether the mmap range lies within a space of MMTk's. Currently Address::from_usize is labelled as unsafe, but there is really nothing unsafe in this scenario. Constructing Address is not unsafe (it can't crash your program), and comparing Address values is not unsafe.

qinsoon commented 1 week ago

Any address can be presented as usize or u64. We can declare fn malloc() -> usize, but we can also declare fn malloc() -> Address. Clearly the latter is more favored. In your example, you can declare fn get_memory_region() -> (Address, Address) and only do the conversion in the function, rather than doing the conversion everywhere in the code base.

I agree some 'unsafe' is unnecessary for Address. But we haven't seen a good example yet.

On the contrary, the following is an example that unsafe is useful. https://github.com/mmtk/mmtk-core/blob/8640ab89f43f33af99e61b9ec8ff2164ed2cfab7/src/util/alloc/bumpallocator.rs#L220