mmtk / mmtk-core

Memory Management ToolKit
https://www.mmtk.io
Other
374 stars 68 forks source link

Generalising weak reference processing and finalisation #694

Open wks opened 1 year ago

wks commented 1 year ago

WARNING: This issue contains wild and crazy ideas.

Currently mmtk-core already has Java-style weak reference processors and finaliser processors. In https://github.com/mmtk/mmtk-core/issues/544, we discussed whether we should keep Java semantics. But as we start to support other languages and VMs, it is clear that we need to go beyond what's available in Java.

Update: After discussions, it is clear that this idea is not crazy. Besides the reasons provided below, another reason for supporting ref processing in bindings is that it will allow us to make apple-to-apple compare MMTk and the VM's own GC because both shall use the same reference processor.

Task list:

Other languages

Java (Yes. Java.)

In addition to java.lang.ref.XxxxReference and things implemented with them (such as WeakHashMap which is implemented with WeakReference), Java also has JNI weak handles which weakly refer to an object, but are not Java objects. The current weak ref processing mechanism cannot handle those weak handles.

Ruby

ObjectSpace::WeakMap and WeakRef: In Ruby, the most basic programmer-visible weak data structure is the ObjectSpace::WeakMap type. It is a weak-key weak-map hash map. If either the key or the value is dead, the key-value pair is removed from the map. It is used to implement the WeakRef type in the stdlib. It stores WeakRef as the key and the referred object as value. If either the WeakRef or the referred object dies, the association between them is removed. Under the hood, ObjectSpace::WeakMap is implemented by adding finalisers on both the key and the value.

Global internal data structures: Some internal data structures in Ruby has weak reference semantics. Those data structure holds per-object data for live objects, but can be cleaned up if the object dies.

The "cleaned when object dies" semantics satisfies the definition of "weak reference". Actually, weak references are intended to be used to implement canonicalising mappings, as described in Java's documentation.

V8 and Ephemeron

V8 supports Ephemeron. Simply speaking, an ephemeron is a pair

struct Ephemeron {
    key: WeakReference,
    value: MaybeWeakReference,
}

If the object referred by the key is alive, the value field behaves like a strong reference; otherwise the value field behaves like a weak reference.

Ephemeron behaves like java.util.WeakHashMap entries. If the key dies, the key-value pair is automatically removed from the WeakHashMap. Under the hood, OpenJDK implements it by using WeakReferences to point to the key. When the key dies, the WeakReference is enqueued, and the WeakHashMap "expunges stale entries" from time to time. It is not as good as Ephemeron, though, because with native Ephemeron support, the GC can clear the value field directly.

Why the current mechanism in MMTk core is not enough?

Different data structures

Different languages/VMs have different weak data structures.

Some of them are not heap objects. For example, JNI weak handles are not heap objects, but MMTk core's ReferenceProcessor assumes weak references are heap objects.

Some of them can hold multiple key-value pairs in one complex data structure. For example, in Ruby, the weak tables are hash tables implemented in C. They cannot be simply updated like the way GC updates fields when an object moves. If the hash table uses object address as the key, and the object is moved, then the table entry needs to be re-hashed because the key changed.

Different semantics

Ephemeron's unusual semantics that "when key dies, the value becomes weak" is not handled by existing things in Java.

Although both Java's WeakHashMap and Ruby's ObjectSpace::WeakMap emulate ephemeron-like behaviour using finaliser, it is not as efficient as supporting Ephemerons directly in GC, because weak maps still briefly keeps the value "alive", while the "expunge stale entry" operations need to be executed at a later time.

Proposed interface

Note: this may be crazy Maybe not that crazy. Wenyu is already doing something like this in the lxr branch of the mmtk-openjdk binding

MMTk core provides a reference processing stages RefClosure (replacing our current XxxRefClosure phase), during which two functions can be called:

And the VMBinding provides one function to be executed by GC worker threads during the new RefClosure phase:

MMTk doesn't care about what the VM do during do_ref_processing().

How to implement Java-style references

The VM binding maintains its own list of "candidate" and "finalized" object lists. During do_ref_processing, the VM binding inspects each candidate.

fn do_ref_processing() {
    for obj in openjdk::soft_weak_phantom {
        if mmtk::is_alive(obj) {
            let dst = obj.ref_field;
            if mmtk::is_alive(dst) {
                trace_object(obj.ref_field);
            } else if openjdk::is_soft_reference(dst) && !mmtk::is_emergency_collection() {
                trace_object(obj.ref_field);
            } else {
                obj.ref_field = Address::NULL;

                if openjdk::has_queue(obj) {
                    openjdk::enqueue(obj);
                }
            }
        }
    }
    for obj in openjdk::finalize_candidates {
        if !mmtk::is_alive(obj) {
            openjdk::finalizable_objects.push(obj);
        }
    }
}

How to implement Ephemeron

fn do_ref_processing() {
    for obj in v8::ephemerons {
        if mmtk::is_alive(obj.key) {
            mmtk::trace_object(obj.value);
        }
    }
}

How to implement global maps in Ruby

fn do_ref_processing() {
    for entry in ruby::obj_id_map {
        if !mmtk::is_alive(entry.obj) {
            ruby::obj_id_map.remove_entry(entry);
        }
    }
    for entry in ruby::gen_ivtbl_map {
        if !mmtk::is_alive(entry.obj) {
            ruby::gen_ivtbl_map.remove_entry(entry);
        }
    }
}

Problems

Update

Wenyu is already doing something similar in the lxr branch of mmtk-openjdk. https://github.com/wenyuzhao/mmtk-openjdk/blob/lxr/mmtk/src/reference_glue.rs#L243-L289

However, I think work packets (GCWork and the buckets) are an implementation detail of mmtk-core, and shouldn't be exposed to the VM binding (I am still open to objections for now). In my proposed API, trace_object can be provided as a call-back closure that encapsulates the logic related to work packets, and the VMBinding only specify which object need to be kept alive.

Update: In https://github.com/mmtk/mmtk-core/pull/700, we encapsulated trace_object behind the ObjectTracer trait (already exists for supporting object-enqueuing tracing), and the new ObjectTracerContext trait encapsulates the creation and flushing of ProcessEdgesWork.

wks commented 1 year ago

The new interface is introduced in this PR: https://github.com/mmtk/mmtk-core/pull/700

The Ruby binding is now able to process obj_ref and finalizers using this API since the following commits:

I have an experimental branch of mmtk-openjdk (https://github.com/wks/mmtk-openjdk/tree/gen-weakref-api). It copies the reference and finalizer processor from mmtk-core to mmtk-openjdk, and benchmarks (see https://github.com/mmtk/mmtk-core/pull/700) show that it is possible to implement reference processing in the binding and reach the same performance as what we currently have, and there is still room for improvement given the limitations (such as the use of mutex) in our current implementation.

From now on, we still need to deprecate the Java-style API in mmtk-core, and reimplement reference processing in OpenJDK in a way native to OpenJDK. Other VM bindings that are using the Java-style API should migrate to the new language-neutral API.