Open wks opened 11 months ago
One thing that concerns me is that most of the 'axioms' are deduced from the current code. As the code may not be correct, the conclusions drawn from the code may not be correct. We should define the semantics of ObjectReference
, and then make our code consistent with the semantics, rather than the other way around.
I will discuss them one by one.
It must be able to instantiate ObjectReference from Address
Did you mean ObjectReference
should be instantiated solely from Address
? If so, this may not be true. A binding could use any sort of information (such as a global table) besides the Address
to constitute an ObjectReference
, especially when it is already doing such for its own GC implementation. It may not be a performant implementation. But as long as it does not make MMTk slower, it should be a valid way to implement ObjectReference
.
It must be efficient to get the start address of the object from the ObjectReference.
We do not require ref_to_object_start
to be efficient. This is the reason why we have two different methods ref_to_address
and ref_to_object_start
. We require the former to be efficient, while we are much more relaxed on the latter.
It must be efficient to do equality test for ObjectReference. It must be hashable
These are all based on the current implementation. The discussion already mentioned alternatives without comparing equality. I think the actual question we would need to answer is that whether we allow different object references to refer to the same object.
One thing that concerns me is that most of the 'axioms' are deduced from the current code. As the code may not be correct, the conclusions drawn from the code may not be correct. We should define the semantics of
ObjectReference
, and then make our code consistent with the semantics, rather than the other way around.
Yes. They are deduced from the current code, for good reasons. MMTk implements the GC, and MMTk needs a concept to express the idea that satisfies the axioms I listed. For example, when copying an object, the object will have a from-space copy and a to-space copy, and they are different, when seen from the GC's point of view. Currently, we say that they have two different ObjectReference
s, one referring to the from-space copy, and the other referring to the to-space copy. After copying, the reference needs to be forwarded. We tell the VM that the object reference needs to be updated from the old ObjectReference
to the new ObjectReference
(in OpenJDK, we store the new ObjectReference
into the slot). However, people may argue that from a Java programmer's point of view, an object reference doesn't change when GC moves object, and in fact Java programmers are oblivious of the fact that GC may move objects (and it doesn't support object pinning to reveal object addresses). Java programmers may argue that the unique identity of an object is an "object reference", and it never changes as long as the object is live. Some VMs, such as Ruby, does give objects unique IDs, and Java programmers can implement such unique IDs using HashMap. However, such IDs are unsuitable to be used as the ObjectReference
type in MMTk because MMTk needs a way to express that "the object moved from A to B", and needs a way to distinguish between the from-space copy and the to-space copy. This means ObjectReference
cannot be 100% opaque to MMTk, but must satisfy some axioms. Well, we may think that the thing that distinguishes the from-space copy and the to-space copy should not be called an "object reference" (yet we have called it "object reference" since the inception of MMTk (JMTk)). We may argue that "we are using 'object reference' the wrong way when we are doing copying GC and reference forwarding" so that we can change the definition of ObjectReference
and use a different term for object movement. That may worth some discussion, and that will be a very fundamental change to MMTk, but I don't object it.
I will discuss them one by one.
It must be able to instantiate ObjectReference from Address
Did you mean
ObjectReference
should be instantiated solely fromAddress
? If so, this may not be true. A binding could use any sort of information (such as a global table) besides theAddress
to constitute anObjectReference
, especially when it is already doing such for its own GC implementation. It may not be a performant implementation. But as long as it does not make MMTk slower, it should be a valid way to implementObjectReference
.
It should be able to instantiate ObjectReference
when given an Address
. It is still allowed to load from the memory pointed by the Address
(i.e. the object header), or look up any global table using the Address
as a key. If the VM defines ObjectReference
as something that includes additional information, and it is possible to reconstruct such information given the Address
and the access to the memory and the table, it'll be OK.
It must be efficient to get the start address of the object from the ObjectReference.
We do not require
ref_to_object_start
to be efficient. This is the reason why we have two different methodsref_to_address
andref_to_object_start
. We require the former to be efficient, while we are much more relaxed on the latter.
OK. "It must be possible to ..." should be a more accurate expression.
It must be efficient to do equality test for ObjectReference. It must be hashable
These are all based on the current implementation. The discussion already mentioned alternatives without comparing equality. I think the actual question we would need to answer is that whether we allow different object references to refer to the same object.
Yes. That's the concern. That is, whether two ObjectReference instances that have different bit-by-bit representation may be considered equal. Another important thing is whether we consider the from-space copy and the to-space copy as equal, as I discussed before. (I also updated the original post and added one axiom for that.)
With issue https://github.com/mmtk/mmtk-core/issues/1170 addressed and PR https://github.com/mmtk/mmtk-core/pull/1195 merged, we now define
ObjectReference
as a non-zero word-aligned address within an object, and we agree with this definition at least for the current implementation of mmtk-core. This issue summarizes discussions about the definition ofObjectReference
before that change so that we can come back and find our previous discussions if we discuss this topic again.TL;DR: From time to time, we discuss the possibility to change the current definition of
ObjectReference
. In theory, it should be opaque to mmtk-core. But I have discovered that not all definitions are good. This issue summarizes what a good definition should be, enumerate some popular definitions and discuss whether they are good.Related links:
Not all definitions of ObjectReference are good
There is an argument that
ObjectReference
may be opaque to mmtk-core. However, some definitions won't work at all, and others won't work efficiently. I am trying to list statements that needs to be true for all good definitions of ObjectReference. Any definition that satisfies all of those statements should work.It must be able to instantiate ObjectReference from Address
Copying GCs will copy objects, and create a new
ObjectReference
for the to-space copy.Linear scanning will identify objects at addresses (using the global VO bit or a local bitmap), and generate
ObjectReference
for those addresses.Conservative stack scanning scans the stack for addresses with VO bit set, then we convert the Address to ObjectReference verbatim.
Handles don't satisfy this statement. Handles are implemented with indirection tables, and creating a handle implies adding a new entry in the indirection table. This simply costs too much, and probably needs synchronization. After forwarding, we will need to delete old handles from indirection tables because they point to from-space copies. Moreover, handles are often local to mutator threads. A more rational definition will be that we define the content in an indirection table entry (which is an address) as an
ObjectReference
. In this way, if an object is moved, mmtk-core will use the new address as the newObjectReference
, and "forwarding a reference in a slot" will become "updating the address in the indirection table entry for the handle in the slot".It must be efficient to get the start address of the object from the ObjectReference.
Given an
ObjectReference
, it must be able to get the start address of an object (i.e. whateveralloc
returns). Its current API isObjectModel::ref_to_object_start()
.It must be efficient to get a unique address inside the object from the ObjectReference.
Given an
ObjectReference
, it must be able to get an address that is guaranteed to be inside the object, and this address needs to bt unique for the same object. Its current API isObjectModel::ref_to_address()
.That address is used for:
ObjectReference
points to an object in a given spaceIn all case, the address is guaranteed to be in the same space where the object is allocated.
It must be efficient to do equality test for ObjectReference.
Currently we do equality test between
ObjectReference
values in a few places:trace_object
, we test if the object has been forwarded by comparingnew_object == object
.trace_object
so that it returnsOption<ObjectReference>
so that we know if an object is forwarded without equality tests.HashSet
(Bug. See: https://github.com/mmtk/mmtk-core/issues/517)ReferenceProcessor
where it usesHashSet
to de-duplicateObjectReference
instances.ObjectReference
to implementEq
.HashSet
to record visited objects.We may refactor them to make
Eq
unnecessary, but it will be counterintuitive if we can't compareObjectReference
for equality.It must be hashable
As mentioned above, we sometimes put
ObjectReference
inside hash sets.When copied, the from-space copy and the to-space copy are considered different objects.
This means when copying an object, the original
ObjectReference
refers to the from-space copy of the object, and the to-space copy of the object will have a differentObjectReference
, and they must not compare equal. The process of "forwarding a reference in a slot" means replace the oldObjectReference
in the slot with the newObjectReference
so that it now points to the to-space copy.At the language level, "a reference to an object" does not change even if the GC moves the object. In other words, the high-level language is oblivious of object movement as a result of GC (unless object pinning is performed which allows the user to reveal the address of an object safely). The high-level language is also oblivious of duplicated copies of objects in concurrent copying GCs, such as Shenandoah, ZGC and Sapphire. That's why the VM (or the GC?) must implement a kind of equality operator that compares them as equal at the language level during concurrent copying, when the object has two copies simultaneously. This means language-level identities (such as unique IDs of language-level objects) are not good definitions of
ObjectReference
.Other statements that should be true
ObjectReference doesn't have to be the content of slots.
An object field (slot) can hold a handle, a fat pointer, an interior pointer, a tagged pointer, etc.
ObjectReference
as the address in the indirection table entry.(pointer, offset)
), we can defineObjectReference
as the pointer part of the fat pointer.ObjectReference
as the highest address that (1) is not higher than the interior pointer, and (2) VO bit is set at that address.ObjectReference
as the address without the tag bits.In all cases, we can update the slot if an object is forwarded.
Examples of valid definitions
Starting address
Obviously. OpenJDK uses starting addresses of objects as
ObjectReference
.Address at an offset from the object start.
JikesRVM does this.
Potential definitions
Tagged union of pointer and non-pointer value
Ruby does this. If a Ruby
VALUE
points to an object, its last three bits are all 0. The pointer will not have any tag bits. Other values (true
,false
,nil
, small integers, etc.) are not references to objects. So we can simply defineObjectReference
as "starting address" (or "address at an offset" if we add additional data in the front).Tagged pointer without type info
V8 does this. The last bit is
1
if a slot holds a reference. The second last bit is 0 if it is a strong reference, and 1 if it is a weak reference. We may defineObjectReference
as the address without the tag bits. MMTk won't be aware of those bits, and the binding is still able to update fields for forwarding.We may define "the address with tag bits" (i.e. the slot content) as
ObjectReference
. MMTk will be able to generate address, but always with0b01
as the last two bits. It is trivial to get the starting address and an in-object address by removing the tags. However, the VM binding will need to implement theEq
and theHash
trait manually and ignore the tag bits. This may not be the most efficient way to do it.Tagged pointer or fat pointer with embedded type info
Some VMs may embed type information inside the pointer, or fat pointer. I heard JRocket did this, but never saw its implementation. This probably will not work because MMTk will have a hard time getting the type info when constructing an
ObjectReference
from anAddress
. It's not completely impossible, but it will need to load the type information from the object body, which may be inefficient. As I mentioned above, for such VMs, we can define the address part of the tagged pointer or fat pointer asObjectReference
.Interior pointer
Probably not a good idea because every time it needs to get the object start or the unique "in-object address", it needs to scan the VO bit bitmap backwards. We may introduce an
InteriorPointer
type in mmtk-core, but as I mentioned above, it is not necessary.