wks commented 1 year ago

Updates

Recent discussions changed my initial thoughts. I list the current status in this section.

Concepts

The topic is about "disjoint objects".
An alloc call creates an allocation.
- An allocation occupies a region of memory in an MMTk space. It's managed by GC.
- An allocation can be in any space. Whether it is an object or buffer is orthogonal to the kind of space it is in.
- Before the reclamation phase of a GC, we decide whether to retain an allocation, or to reclaim (a.k.a. to release, to recycle, to free) it.
An allocation can be either an object or a buffer.
- An object owns many buffers. A buffer is owned by exactly one object.
- An object plus all of its buffers appear as one "object" on the programming language level of the VM.
- An object has liveness. It is determined by the GC (by tracing or reference counting). An object is retained if it is live, or reclaimed if it is dead.
- On the contrary, a buffer doesn't have liveness. During GC, the VM decides whether to retain or reclaim a buffer based on the liveness of the object that owns it, and other heuristics (such as merging or splitting objects).

An implication of the retain and reclaim semantics is that

Each space must have its own metadata about whether to retain an allocation or not.
- Free-list space: There should be a bitmap to indicate whether each cell in a free-list should be retained. If a cell is retained, the allocation in the cell is retained.
- Copy space: allocations that should be retained shall be evacuated. No metadata is needed.
- Immix space: Each line has a bit to indicate whether a line is retained. If some lines are retained, all allocations within or spanning those lines is retained.
This metadata is orthogonal to the liveness of objects.
- Specifically, free-list spaces should not reuse the mark bit as the "reclaim bit", because
  - If we use reference counting, there may not be mark bit, and
  - Even if we use mark-sweep, the object reference may have an offset after the starting address of the cell.

To summarise in simpler words:

A space
- has many allocations
- unless it is a copy space, it has metadata to record whether to retain or reclaim any of its allocation units (cells, lines, etc.)
An allocation
- is an allocated region of memory in any GC space.
- can be retained or reclaimed
An object
- is a kind of allocation.
- can own many buffers.
- can be alive or dead. It is retained if and only if it is alive.
- has metadata about liveness.
A buffer
- is also a kind of allocation
- can be owned by exactly one object
- is decided by the VM whether to retain or reclaim it during GC.
- does not have metadata about liveness
An "object" at the programming language level
- is represented as an MMTk-level object plus all of its buffers.
- cannot see "buffers" at the language level
- may appear to be resizable
  - implemented as resizing/merging/splitting of its underlying allocations.

Implementation

Allocation

Both objects and buffers are allocations. Buffers are allocated using the same Mutator::alloc function. But we either do not need post_alloc, or we use a different post_alloc.

Retaining and copying buffers

When scanning an object, the VM identify pointers in the header object that points to buffers. mmtk-core provides an API for the VM. The signature of the function looks like this:

/// `buffer`: The address of the buffer
/// `size`: The size of the buffer
/// `alignment`: The alignment of the buffer
/// Returns: The new address of the buffer.
fn retain_buffer(buffer: Address, size: usize, alignment: usize) -> Address;

The VM needs to provide the size and the alignment to MMTk core because MMTk core cannot get those information from the buffer alone (as there is no such information in the buffer). If it is copying GC, MMTk core will use that information to copy the buffer. The buffer is copied like copying ordinary objects, via ObjectModel::copy provided by the VM. The new address of the buffer is returned.

Question: Should we provide another ObjectMode::copy_buffer? ObjectModel::copy takes an ObjectReference which assumes it is a reference to an object, not a buffer.

However, if the VM doesn't want to retain the buffer, it simply ignore the buffer, and MMTk will treat the buffer as dead.

Original thoughts

The rest of this post are my initial thoughts

"Naked" objects

Not all objects are created equal. In some virtual machines, some objects are wholly owned by other objects. For example:

In Julia and Ruby, array objects have underlying buffers. The "main" array object is small, and it contains a pointer to a separately allocated buffer. The buffer is simply a homogeneous array of values (or references), and does not have attached metadata to identify its length and its reference fields (and therefore "naked").
In Ruby, not only arrays, but also "ordinary" Ruby objects (with an unspecified number of instance variables), strings, hash tables, big integers and structs have such separately-allocated buffers. What's more, those objects can be resized. When the elements are few, it will switch to an "embedded" layout, i.e. storing the array elements, string contents, etc. into the "main" object itself; but when the data structures grow, it will move the contents out of line, in a separately allocated buffer.

Given that more than one VM have such objects, it may be worth adding support to such "naked" objects in mmtk-core.

Primary and subsidiary ("naked") objects

An object can be either primary or subsidiary. Primary objects are the objects we know before. Subsidiary objects are what we called "naked" objects.

Both kinds of objects are allocated in the GC heap.

Both are allocated through the object-allocation API (alloc).
Both can be in any space. i.e. any space can contain primary or subsidiary objects

Their differences are,

A primary object can own zero or more subsidiaries, while a subsidiary object is owned by exactly one primary object.
- We call it the "parent" of the subsidiary object, like a parent company fully owns a subsidiary company.
- (We avoid the word "header" in order not to be confused with "object headers", i.e. the flags, klass field, etc.)
- We call a parent plus all its subsidiaries an "object group" (as in the corporation setting).
A primary object cannot have reference to subsidiaries it does not own.
- In other words, a subsidiary object can only be referenced by its parent.
The lifetime of a subsidiary is determined by its parent.
- That means, during tracing, the parent can determine whether its subsidiaries live or die.
A subsidiary object may or may not have enough information to scan/mark/copy itself.
- It may not have type info to find its reference fields.
- It may not know its own length.
- It may not have an in-object mark bit to record whether it has been reached during GC.
- The parent object may hold the length of its subsidiary buffer.
- That's why we call it "naked". But in theory a subsidiary doesn't have to be "naked". It can have all the type info while still being subsidiary.

Difference from vanilla Ruby's buffers

A difference between the "subsidiary" objects defined here and the buffers of Array and String in vanilla Ruby is that both the primary and the subsidiary objects defined here are managed by the MMTk GC, while Ruby's buffers are allocated by malloc, and are freed by finalizers (obj_free).

Object graph and subsidiary objects

An object graph contains nodes and edges.

Without subsidiary objects, all object are primary. A node is a (primary) object. An object contains many reference fields, each of which represents an edge to another objects.

With primary and subsidiary objects, a node is an object group, i.e. one primary object plus all subsidiary objects it owns. Reference fields in both the primary and its subsidiary objects are the edges of the node. Edges only point to primary objects. The pointer from a parent to a subsidiary is not considered an edge in the object graph -- it's internal to a node.

Opportunity of object merging/splitting during copying

During copying, the GC has the opportunity to resize objects, and the opportunity to merge or split objects in a group.

For example,

When an array is small, the VM may choose to hold its elements inside the primary object -- the Array object itself.
When an array grows, the VM can split its elements off the primary object. It allocates a smaller primary object and a large subsidiary object, then copy the original elements to the subsidiary object, and let the new primary object point to the subsidiary.
When an array shrinks, the VM can allocate a bigger primary object, big enough to hold all elements, then copy its element from the subsidiary to the new primary object.

As in the current MMTk interface, the VM is responsible for copying objects during copying GC. Some VMs are already using this opportunity to implement address-based hashing. We can extend this mechanism and let the VM decide whether to resize, split or merge objects.

However, in concurrent copying GC, it is the VM's responsibility to handle the synchronization between the mutator and the GC.

steveblackburn commented 1 year ago

A few comments about the abstractions...

The structures discussed here are not visible at the language level. The abstraction is a single object.
For the above reason, we should not call the secondary element an "object" or even a "secondary object" or "naked object" since it is in no sense an object; it is merely a piece of allocated memory that is part of an object.
Until we can find a better name, let's call it a buffer.
Buffers are only pointed to by raw pointers, never by object references.
The ability to introspect is a function of the object header, which is not part of the buffer, so buffers can only be introspected in the context of their parent object.

Summarizing the terminology:

The concept we are discussing is "disjoint objects".
Disjoint objects are an implementation detail not visible at the language level.
A disjoint object is comprised of a "parent object" which refers to or more "buffers"
A buffer may only be referenced from its parent object, via a raw pointer.
The liveness of a buffer is determined by its parent
A buffer is not an object and therefore is typeless. The type of the data contained within a buffer is dictated by the parent object.

It is not our goal to implement Ruby's disjoint objects---one of the primary reasons for moving to MMTk is to avoid the limitation in Ruby that requires disjoint objects.

Languages will likely use disjoint objects to implement arrays that may be resized.

wks commented 1 year ago

@steveblackburn What name should we give to the thing returned by alloc? It is "a region of memory where the VM is allowed to write in, and is managed by the garbage collector". Both "parent objects" and "buffers" are allocated this way. (Or, should they?) If they are, there should be a concept that is a union of "(parent) objects" and "buffers"

qinsoon commented 1 year ago

I suggested using a special policy for buffers. I think it does not conflict with Steve's description about the disjoint objects. It is indeed one way to implement disjoint objects. Instead of having each policy deal with the special buffer object, we could make it only known to the buffer policy. The following is a comparison of implementing buffers in a special policy and allowing buffer in any policy.

	Buffers only in a special policy	Buffers in any policy
Header metadata	The policy knows it cannot use object header metadata.	Each policy needs to deal with a buffer's metadata specially, as it cannot use header metadata.
Tracing objects	The policy knows it should not trace objects in the policy.	A policy does not know if the object is a buffer or not, unless it stores some metadata to identify buffers.
Keeping buffers alive	The policy provides a method. We can implement the policy to make this easy.	Each policy needs to provide a method for this.
Querying objects (is_mmtk_object, is_live, is_movable, base pointer)	The policy knows buffers should not be introspected	A policy does not know if the object is a buffer or not, and may return incorrect wrong results, unless it store some metadata to identify buffers.
Merging objects when copying	Possible	Possible

qinsoon commented 1 year ago

With disjoint objects, in some places where we refer to objects, we need to differentiate 'object' (parent + buffer) from just 'parent'. For example, when we ask a binding for object sizes:

for copying, we ask for object size (they could tell us the parent size, or the entire object size if they want to merge the buffer with the object)
for linear scan, we need to explicitly ask for the parent size. If they tell us the entire object size, that would be wrong.

qinsoon commented 1 year ago

Some conclusions from our discussion (based on the table above):

Header metadata: a buffer does not have any header metadata, or per-object metadata. Its retention is done by some internal metadata maintained by the policy -- a policy has to be able to retain memory without per-object metadata. E.g. line mark for immix, treadmill for large object space.
Tracing objects: we do not trace buffers, and a binding should not call trace_object() otherwise it is a bug in the binding.
Keeping buffers alive: retention is different from identification. Retention by each policy should be done in their specific way without using per-object metadata.
Querying objects: A binding should not turn a buffer address into an object reference and query us. For is_mmtk_object(addr), as we will not set alloc bit (valid object bit) for the buffer, it will return true. It is not an MMTk object. Also we do not support internal reference into a buffer (as it is not an object).

A few other things that we discussed:

Copying:
- merging object: when MMTk asks a binding to copy the parent object, the binding can copy the buffer.
- not merging, i.e. retaining buffer: a corner case would be that the parent object is in a non-moving policy, and the buffer is in a moving policy, say copy space (a binding may want to avoid this though). When a binding calls MMTk to retain the buffer, MMTk may call to the binding for copying.
alloc/post_alloc: we may either have a separate alloc_buffer() method, or a separate post_alloc_buffer(). A policy may need this to set some metadata for the buffer (but not normal per-object metadata).
get_current_size(): should return the size of the parent object (we use it for linear scan or debugging). get_size_when_copied() (used by mark compact for calculating copy addresses) could return the parent size or the parent+buffer size, depending on whether a binding would merge buffers when copying.
Mark compact: we should see if the proposal works for mark compact, it is a bit special in terms of copying.

caizixian commented 1 year ago

Just to record my notes for the above discussion:

identification: we want to find reachable nodes on the object graph. A node usually has one contiguous piece of memory. but may have disjoint pieces of memory (in the case of Ruby, the main object, and another buffer only referred to by the object using a raw pointer)
reclaimation: cannot reclaim memory used by live objects (each such object may have disjoint pieces of memory)

mmtk / mmtk-core

Supporting disjoint objects #656

Updates

Concepts

Implementation

Allocation

Retaining and copying buffers

Original thoughts

"Naked" objects

Primary and subsidiary ("naked") objects

Difference from vanilla Ruby's buffers

Object graph and subsidiary objects

Opportunity of object merging/splitting during copying