WARNING: This issue contains wild and crazy ideas.

Currently mmtk-core already has Java-style weak reference processors and finaliser processors. In https://github.com/mmtk/mmtk-core/issues/544, we discussed whether we should keep Java semantics. But as we start to support other languages and VMs, it is clear that we need to go beyond what's available in Java.

Update: After discussions, it is clear that this idea is not crazy. Besides the reasons provided below, another reason for supporting ref processing in bindings is that it will allow us to make apple-to-apple compare MMTk and the VM's own GC because both shall use the same reference processor.

Task list:

[X] Introduce a language-neutral API for processing references in VM bindings. (https://github.com/mmtk/mmtk-core/pull/700)
[ ] Migrate bindings away from the reference/finalizer processors in mmtk-core.
- [X] Ruby
- [ ] OpenJDK
- [ ] JikesRVM
- [ ] Julia
[ ] Deprecate and remove the reference/finalizer processors in mmtk-core.

Other languages

Java (Yes. Java.)

In addition to java.lang.ref.XxxxReference and things implemented with them (such as WeakHashMap which is implemented with WeakReference), Java also has JNI weak handles which weakly refer to an object, but are not Java objects. The current weak ref processing mechanism cannot handle those weak handles.

Ruby

ObjectSpace::WeakMap and WeakRef: In Ruby, the most basic programmer-visible weak data structure is the ObjectSpace::WeakMap type. It is a weak-key weak-map hash map. If either the key or the value is dead, the key-value pair is removed from the map. It is used to implement the WeakRef type in the stdlib. It stores WeakRef as the key and the referred object as value. If either the WeakRef or the referred object dies, the association between them is removed. Under the hood, ObjectSpace::WeakMap is implemented by adding finalisers on both the key and the value.

Global internal data structures: Some internal data structures in Ruby has weak reference semantics. Those data structure holds per-object data for live objects, but can be cleaned up if the object dies.

ID: Each Ruby object may have an ID, obtained by obj.object_id. The ID is guaranteed to be unique while the object is alive. Under the hood, the Ruby runtime maintains a global bidirectional ID-to-object and object-to-ID map. When an object is moved, the gc_move function updates the bi-directional map; when an object dies, the finaliser obj_free removes that object from the bidirectional map.
gen_ivtbl: Objects other than T_OBJECT have their instance variables held in an external table, and a global map generic_iv_tbl_ maps each object to its "gen_ivtbl". When an object dies, its associated "gen_ivtbl" is freed.

The "cleaned when object dies" semantics satisfies the definition of "weak reference". Actually, weak references are intended to be used to implement canonicalising mappings, as described in Java's documentation.

V8 and Ephemeron

V8 supports Ephemeron. Simply speaking, an ephemeron is a pair

struct Ephemeron {
    key: WeakReference,
    value: MaybeWeakReference,
}

If the object referred by the key is alive, the value field behaves like a strong reference; otherwise the value field behaves like a weak reference.

Ephemeron behaves like java.util.WeakHashMap entries. If the key dies, the key-value pair is automatically removed from the WeakHashMap. Under the hood, OpenJDK implements it by using WeakReferences to point to the key. When the key dies, the WeakReference is enqueued, and the WeakHashMap "expunges stale entries" from time to time. It is not as good as Ephemeron, though, because with native Ephemeron support, the GC can clear the value field directly.

Why the current mechanism in MMTk core is not enough?

Different data structures

Different languages/VMs have different weak data structures.

Some of them are not heap objects. For example, JNI weak handles are not heap objects, but MMTk core's ReferenceProcessor assumes weak references are heap objects.

Some of them can hold multiple key-value pairs in one complex data structure. For example, in Ruby, the weak tables are hash tables implemented in C. They cannot be simply updated like the way GC updates fields when an object moves. If the hash table uses object address as the key, and the object is moved, then the table entry needs to be re-hashed because the key changed.

Different semantics

Ephemeron's unusual semantics that "when key dies, the value becomes weak" is not handled by existing things in Java.

Although both Java's WeakHashMap and Ruby's ObjectSpace::WeakMap emulate ephemeron-like behaviour using finaliser, it is not as efficient as supporting Ephemerons directly in GC, because weak maps still briefly keeps the value "alive", while the "expunge stale entry" operations need to be executed at a later time.

Proposed interface

~~Note: this may be crazy~~ Maybe not that crazy. Wenyu is already doing something like this in the lxr branch of the mmtk-openjdk binding

MMTk core provides a reference processing stages RefClosure (replacing our current XxxRefClosure phase), during which two functions can be called:

is_alive(ObjectAddress) -> bool: Return whether an object is alive.
- Update: is_reachable should be a better name.
trace_object(ObjectAddress) -> ObjectAddress: Keep the object alive, trace that object, and return its new address (if moved).

And the VMBinding provides one function to be executed by GC worker threads during the new RefClosure phase:

Collection::do_ref_processing(): Do whatever the VM needs to process weak refs. MMTk core may call this multiple times if the VM keeps additional objects alive via trace_object.

MMTk doesn't care about what the VM do during do_ref_processing().

How to implement Java-style references

The VM binding maintains its own list of "candidate" and "finalized" object lists. During do_ref_processing, the VM binding inspects each candidate.

fn do_ref_processing() {
    for obj in openjdk::soft_weak_phantom {
        if mmtk::is_alive(obj) {
            let dst = obj.ref_field;
            if mmtk::is_alive(dst) {
                trace_object(obj.ref_field);
            } else if openjdk::is_soft_reference(dst) && !mmtk::is_emergency_collection() {
                trace_object(obj.ref_field);
            } else {
                obj.ref_field = Address::NULL;

                if openjdk::has_queue(obj) {
                    openjdk::enqueue(obj);
                }
            }
        }
    }
    for obj in openjdk::finalize_candidates {
        if !mmtk::is_alive(obj) {
            openjdk::finalizable_objects.push(obj);
        }
    }
}

How to implement Ephemeron

fn do_ref_processing() {
    for obj in v8::ephemerons {
        if mmtk::is_alive(obj.key) {
            mmtk::trace_object(obj.value);
        }
    }
}

How to implement global maps in Ruby

fn do_ref_processing() {
    for entry in ruby::obj_id_map {
        if !mmtk::is_alive(entry.obj) {
            ruby::obj_id_map.remove_entry(entry);
        }
    }
    for entry in ruby::gen_ivtbl_map {
        if !mmtk::is_alive(entry.obj) {
            ruby::gen_ivtbl_map.remove_entry(entry);
        }
    }
}

Problems

Q: Can this be parallelised?
- A: MMTk can provide a callback so that do_ref_processing can create sub-tasks, while MMTk-core create multiple work packets under the hood.
Q: How to support multiple strength levels (soft, weak, finalizer, phantom, ...)
- A: MMTk core can call do_ref_processing multiple times, passing a integer parameter that indicates how many time MMTk has done the transitive closure. It is up to the VM binding to interpret the integer, for example, when n = 1, handle soft references; when n = 2, handle weak references, ...
- Update: The new "sentinel" mechanism (introduced in https://github.com/mmtk/mmtk-core/pull/700) allows the GC to expand transitive closure multiple times, and call process_weak_refs each time a transitive computing is finished. The VM binding can implement a state machine to handle a different strength each time.
Q: This looks very unsafe. The VM can basically do anything here.
- A: It is just a matter of whether MMTk core or VM can do it better.

Update

Wenyu is already doing something similar in the lxr branch of mmtk-openjdk. https://github.com/wenyuzhao/mmtk-openjdk/blob/lxr/mmtk/src/reference_glue.rs#L243-L289

However, I think work packets (GCWork and the buckets) are an implementation detail of mmtk-core, and shouldn't be exposed to the VM binding (I am still open to objections for now). In my proposed API, trace_object can be provided as a call-back closure that encapsulates the logic related to work packets, and the VMBinding only specify which object need to be kept alive.

Update: In https://github.com/mmtk/mmtk-core/pull/700, we encapsulated trace_object behind the ObjectTracer trait (already exists for supporting object-enqueuing tracing), and the new ObjectTracerContext trait encapsulates the creation and flushing of ProcessEdgesWork.

mmtk / mmtk-core

Generalising weak reference processing and finalisation #694