munificent / craftinginterpreters

Repository for the book "Crafting Interpreters"
http://www.craftinginterpreters.com/
Other
8.87k stars 1.04k forks source link

Doubts about garbage collection #1081

Open nocturn9x opened 2 years ago

nocturn9x commented 2 years ago

After several iterations and many, many lines of code written I've finally gotten around to implementing a GC (for the first time!), in my bytecode VM. Due to the radical difference in design of my implementation compared to Lox, my GC is more independent and separate from the VM (it's basically a struct of its own which keeps track of internal state). I have recently switched to unboxed types (not NaN boxing!), so my VM's stack is basically a list of unsigned 64-bit integers. Primitive types like numbers, booleans, nil, nan and infinity are represented directly on the stack with no heap allocation (fast!), while more complex objects such as strings are wrapped into a thin tagged union (slow) and the pointer to said union is stored on the stack as, again, a 64-bit unsigned integer and casted/dereferenced as needed. Due to this new design, the VM has no notion of type. In fact, it's the compiler that's telling the VM what type is what and where and how to deal with the bits on the stack. For this reason, the GC object itself has a pointers attribute, which is a hash set of all the pointers the GC has allocated, as well as a list of pointers to HeapObject structs (I deduplicate it because I need to access items on this list by index and sets are unordered by nature). My markRoots function looks like this:

proc markRoots(self: PeonGC): seq[ptr HeapObject]  =
    ## Marks root objects *not* to be
    ## collected by the GC and returns
    ## their addresses

    # Unlike what bob does in his book,
    # we keep track of objects in a different
    # way due to how the whole thing is designed.
    # Specifically, we don't have neat structs for
    # all peon objects: When we allocate() an object, 
    # we keep track of the small wrapper it created 
    # along with its type and other metadata. Then, 
    # we can go through the various sources of roots 
    # in the VM, see if they match any pointers we 
    # already know about (we store them in a hash set so 
    # it's really fast), and then we can be sure that 
    # anything that's in the difference (i.e. mathematical
    # set difference) between our full list of pointers
    # and the live ones is not a root object, so if it's 
    # not indirectly reachable through a root itself, it 
    # can be freed. I'm not sure if I can call this GC 
    # strategy precise, since technically there is a chance 
    # for a regular value to collide with one of the pointers 
    # we allocated and that would cause a memory leak, but 
    # with a 64-bit address-space it probably hardly matters,
    # so I guess this is a mostly-precise Mark&Sweep collector
    var live = initHashSet[uint64]()
    for obj in self.vm.calls:
        if obj in self.pointers:
            live.incl(obj)
    for obj in self.vm.operands:  # I have a separate operand stack
        if obj in self.pointers:
            live.incl(obj)
    for obj in self.vm.closedOver:   # Separate stack for closure variables
        if obj in self.pointers:
            live.incl(obj)
    # We preallocate the space on the seq
    result = newSeqOfCap[ptr HeapObject](len(live))
    var obj: ptr HeapObject
    for p in live:
        obj = cast[ptr HeapObject](p)
        if obj.mark():
            result.add(obj)
        when debugMem:
            if obj.marked:
                echo &"DEBUG - GC: Marking object: {obj[]}"

As already mentioned in the comment, the issue is that while the GC knows which pointers it has allocated, it has no idea where they are (sounds like a weird mix of conservative and precise GCs!) and it cannot tell the difference between a heap-allocated object and a variable whose value happens to collide with a pointer we have allocated. Considering modern x86-64 processors have a ~48 bit address space, how likely is it for this potential memory leak to be a problem?

munificent commented 2 years ago

Considering modern x86-64 processors have a ~48 bit address space, how likely is it for this potential memory leak to be a problem?

Conservative collectors are used in practice, so it's apparently tenable. It means there may be objects that aren't actually used and could be freed, but your collector won't free them because it thinks a value of some other type is actually a pointer into that object.

nocturn9x commented 2 years ago

I see, Thanks! I was kinda afraid that the birthday paradox would cause the GC to progressively and exponentially leak more and more memory. Is that a problem in your opinion?