Closed wks closed 7 months ago
I think we had some discussion about soft reference in https://github.com/mmtk/mmtk-core/pull/564#issuecomment-1097453686 and https://github.com/mmtk/mmtk-core/pull/564#pullrequestreview-940369329. At that time, we concluded it was fine to not transitively trace soft references.
So the actual issue is that we add more soft references to the soft reference list. They are not scanned, and not retained, but we attempt to forward those references. Is this understanding correct?
I think we had some discussion about soft reference in #564 (comment) and #564 (review). At that time, we concluded it was fine to not transitively trace soft references.
Yes. The Java API allows different policies for retaining SoftReference
to be implemented. It's OK not to retain transitively reachable SoftReference
.
So the actual issue is that we add more soft references to the soft reference list. They are not scanned, and not retained, but we attempt to forward those references. Is this understanding correct?
Exactly. They should either be retained or cleared, but it's currently neither. And we attempted to forward the referent (The SoftReference
itself is safe to forward because it is marked as an ordinary object, but not scanned, yet.) which is not marked.
The easiest fix should be clearing the newly added SoftReference
instances. There is a comment in the reference processor:
fn retain<E: ProcessEdgesWork>(&self, trace: &mut E, _nursery: bool) {
// ...
for reference in sync.references.iter() {
// ...
if !reference.is_live::<E::VM>() {
// Reference is currently unreachable but may get reachable by the
// following trace. We postpone the decision.
continue;
}
// ...
}
// ...
}
We postponed the decision, but we didn't make decision at the right time. See:
pub fn scan_soft_refs<E: ProcessEdgesWork>(&self, trace: &mut E, mmtk: &'static MMTK<E::VM>) {
if !mmtk.state.is_emergency_collection() {
self.soft.retain::<E>(trace, is_nursery_gc(mmtk.get_plan()));
}
self.soft.scan::<E>(trace, is_nursery_gc(mmtk.get_plan())); // ERROR: It's too early!!!!!!!!!!!
}
The call to self.soft.scan
happens too early. By retaining soft references, we extended the transitive closure of live objects. This means two things: (1) More live objects will be discovered, and more importantly, (2) More live SoftReference
(and other forms of references) will be discovered. We should finish the transitive closure before calling self.soft.scan
.
One way to implement it is making use of the "bucket sentinel" feature I introduced for implementing VM-side reference processing. If we choose to retain soft references, we should postpone self.soft.scan
to a "sentinel packet" at the end of the SoftRefClosure
bucket.
BTW, I think the bucket names WeakRefClosure
and PhantomRefClosure
are misnomers. Weak references and phantom references never expand the transitive closure. (On the contrary, soft references and finalizers do.) Weak and phantom references should be either forwarded or cleared, but never enqueue any objects.
A related bug is that finalizable objects can also expand the transitive closure when resurrected. If a finalizable object points to a WeakReference
(or other kinds of references), the reference will not be forwarded or cleared, either. The following program is guaranteed to crash MarkCompact
and SemiSpace
, too:
import java.lang.ref.*;
class Bar {
}
class Foo {
public static Foo resurrectedFoo = null;
public WeakReference<Bar> weak;
public Foo() {
this.weak = new WeakReference<>(new Bar());
}
@Override
public void finalize() {
System.out.println("I am live again!");
resurrectedFoo = this;
}
}
public class FinalAndWeak {
public static void mkObject() {
Foo foo = new Foo();
}
public static void main(String[] args) throws InterruptedException {
System.out.println("Making object...");
mkObject();
System.out.println("Triggering GC...");
System.gc();
while (Foo.resurrectedFoo == null) {
System.out.println("Waiting for Foo.finalize()...");
Thread.sleep(100);
}
Foo theFoo = Foo.resurrectedFoo;
System.out.println(theFoo.weak.get());
}
}
To fix this problem, we need to re-scan soft and weak references after resurrecting finalizable objects.
By re-scanning soft and weak references after resurrecting finalizable objects, the crash disappeared.
However, something is still inconsistent. For example:
import java.lang.ref.*;
class Bar {
int num;
public Bar(int num) {
this.num = num;
}
@Override
public String toString() {
return String.format("Bar(%d)", this.num);
}
}
class Foo {
public static Foo resurrectedFoo = null;
public WeakReference<Bar> weak1;
public WeakReference<Bar> weak2;
public Bar bar2;
public Foo(Bar bar1, Bar bar2) {
this.weak1 = new WeakReference<>(bar1);
this.weak2 = new WeakReference<>(bar2);
this.bar2 = bar2;
}
@Override
public void finalize() {
System.out.println("I am live again!");
resurrectedFoo = this;
}
}
public class FinalAndWeak {
public static WeakReference<Bar> mkObject() {
Bar bar1 = new Bar(1);
Bar bar2 = new Bar(2);
Foo foo = new Foo(bar1, bar2);
WeakReference<Bar> ref2 = new WeakReference<>(bar2);
return ref2;
}
public static void main(String[] args) throws InterruptedException {
System.out.println("Making object...");
WeakReference<Bar> ref2 = mkObject();
System.out.println("Triggering GC...");
System.gc();
while (Foo.resurrectedFoo == null) {
System.out.println("Waiting for Foo.finalize()...");
Thread.sleep(100);
}
Foo theFoo = Foo.resurrectedFoo;
System.out.format("theFoo.weak1.get() == %s%n", theFoo.weak1.get());
System.out.format("theFoo.weak2.get() == %s%n", theFoo.weak2.get());
System.out.format("theFoo.bar2 == %s%n", theFoo.bar2);
System.out.format("ref2.get() == %s%n", ref2.get());
}
}
There is a weak reference from the root to Bar(2)
, and there is also a strong and a weak reference from Foo
to Bar(2)
. However, after the GC, the WeakReference<Bar> ref2
in main
is cleared, but the WeakReference<Bar> weak2
in Foo
is not. Here is the result when running mmtk-openjdk after my fix:
[2024-04-20T12:18:10Z INFO mmtk::scheduler::scheduler] End of GC (672/51200 pages, took 14 ms)
I am live again!
Waiting for Foo.finalize()...
theFoo.weak1.get() == null
theFoo.weak2.get() == Bar(2)
theFoo.bar2 == Bar(2)
ref2.get() == null
Strictly speaking, this is not compliant with the Java API. The Java API demands that (link)
Suppose that the garbage collector determines at a certain point in time that an object is weakly reachable. At that time it will atomically clear all weak references to that object and all weak references to any other weakly-reachable objects from which that object is reachable through a chain of strong and soft references. ...
The result clearly shows that it didn't clear all weak references.
However, the official OpenJDK behaves this way, too. Here is the result from vanilla OpenJDK 22:
Triggering GC...
Waiting for Foo.finalize()...
I am live again!
theFoo.weak1.get() == Bar(1)
theFoo.weak2.get() == Bar(2)
theFoo.bar2 == Bar(2)
ref2.get() == null
I won't worry too much about this detail because the vanilla OpenJDK is not compliant, either.
After retaining and scanning
SoftReference
, it generatesProcessEdgesWork
work packets to trace the children of retainedSoftReference
instances. During this time, moreSoftReference
instances will be discovered, but they will not be traced. That will cause those transitively discoveredSoftReference
to contain un-updated reference to the children in the from-space, or crash MarkCompact due to those object not having forwarding pointers.This bug was originally discovered when running CI tests for the PR removing
ObjectReference::NULL
when runningfop
in DaCapo 2006 using MarkCompact. See: https://github.com/mmtk/mmtk-openjdk/actions/runs/8750247972/job/24016142656?pr=265The program
How to crash SemiSpace?
Run it with
MMTK_NO_REFERENCE_TYPE=true
, preferrably with stack trace and logs enabled.It will print:
How to crash MarkCompact
We first patch mmtk-core so that
MarkCompactSpace::trace_forward_object
asserts the object has forwarding pointer. If an object is reachable during theSecondRoot
stage or theRefForwarding
stage, it must be live, and its forwarding pointer must have been computed during theCalculateForwarding
stage.Then run the same program with the same env vars except using MarkCompact. It will crash with the following log:
What went wrong?
This portion of the log reveals the cause:
A new soft candidate is discovered after processing all soft references. The OpenJDK binding discovers soft/weak/phantom references gradually during tracing. The new soft candidate 0x4066bf28 was added. Itself was traced. But when it was scanned, it added itself to the reference processor using
add_soft_candidate
. However, it never gets processed, and its referent is not traced, either.In SemiSpace, the referent pointer still pointed to the from-space. During the
Release
stage, it tried to find references to enqueue, and the debug assertion discovered that some SoftReference still pointed to the from space.In MarkCompact, things were a bit complicated. During
CalculateForwarding
, the GC computeed the forwarded addresses of all live objects. However, since the referent of the newly discovered SoftReference was not marked, it was considered dead, and was not assigned a forwarding pointer. However, duringSecondRoot
, the GC computed another transitive closure and found the SoftReference to be live. It then attempted to forward the referent (which is dead), and found it did not have a forwarding pointer. The error would have silently slipped away without checking, but after I removedObjectReference::NULL
, it started to require an explicit NULL check against the forwarding pointer, and caught the bug.