iu-parfunc / gibbon

A compiler for functional programs on serialized data
http://iu-parfunc.github.io/gibbon/
157 stars 13 forks source link

A GC algorithm for semi-packed data #122

Open rrnewton opened 5 years ago

rrnewton commented 5 years ago

Gibbon2 (PLDI19) maintains computational asymptotic complexity of input programs, unlike Gibbon1 (Ecoop17). However, it doesn't maintain the space complexity of input programs, because it doesn't have a working garbage collection strategy that can bound dead space. In considering such a GC algorithm, there are really two ways to look at it, both of which amount to the same thing:

Either way the problem is changed substantially by having a pure language with an acyclic heap -- at the level of algebraic datatypes and likely at the level of regions as well (but that's a design choice, see below). To sketch out the space of possibilites, I describe a series of GC possibilities for Gibbon below. This builds on top of an earlier issue (#100), which had some comments about the low level representation of regions, especially small, fixed-side regions.

Side note: Even though we have multiple possibilities below, we're leaving linear types and mutable data outside of the scope of this issue. Those are further extensions that complicate the picture.

GC1: A stop-the world, 1-thread, non-generational copying GC

As a baseline, consider that whenever the mutator runs out of space, it invokes a traditional, mostly-textbook copying GC. Local cursors that live in stack variables and registers create the rootset: the live locations within heap data. Of course these can point into the middle of regions:

                  Root                                            
┌───────────────┐ set:                                            
│  Stack frame  │──────────────root───────┐                       
├───────────────┤                         │                       
│  Stack frame  │───root          ┌───────▼──────────────────────┐
├───────────────┤     │           │            Region            │
│  Stack frame  │     │           └──────────────────────────────┘
└───────────────┘    ┌▼─────────────────────────────┐             
                     │            Region            │             
                     └──────────────────────────────┘             

In an extreme case we could compact the heap to a SINGLE region (to space), and destroy ALL sharing.

Theorem: the resulting heap has ZERO indirection (I) nodes.

When this algorithm is done, ALL from-space regions can be dropped. In fact the region-level reference counting that normally goes along with let-region deallocation don't help us here.

Problem1: maintaining sharing

The strawman algorithm above doesn't maintain sharing. It has the wrong space complexity. For example, consider the below extreme case:

data T = N T T | L Double deriving Show

shared :: Int -> T
shared 0 = L 3.3
shared n = let sub = shared (n-1) in
           N sub sub

While processing (say, summing) the resulting tree of depth D is 2^D work, it should only take O(D) space. But if our naive GC kicks in, it detroys sharing and consumes 2^D space.

A solution would be to go to a full Cheney algorithm for the copying GC. This means using forwarding pointers.

As an example of what we cannot allow, if we have an enum data E = A | B | C, these would only take one byte to represent. Putting three of these tags next to eachother K A B C, would mean three consecutive bytes. If we then built indirection nodes pointing to each of these three consecutive bytes, there would not be room to place forwarding pointers for all three.

Thus if we want to store forwarding pointers in the from-space, we need to obey a global invariant that we only point at things that are "big enough" (>= 9 bytes). Fortunately, this is compatible with the ideas about reducing fragmentation in small-fixed size regions (#100).

Ammended theorem: after collection, there will only be a minimal set of indirection nodes in the heap corresponding to "shared data" (object tags reachable by multiple paths from the root set).

GC2: Add incrementality

GC3: Add parallelism

GC4:

vollmerm commented 5 years ago

A few practical issues with adding a moving GC to Gibbon, and how we might solve them:

The typical way we'd keep track of live data while targeting C or LLVM would be a shadow stack, which wouldn't be too hard to add to our code generation. However, we don't store any type information in the heap, so simply having a list of pointers to regions is not sufficient to trace through memory (ie, the region would start with a tag, but the GC wouldn't know the type and therefore wouldn't know how to walk through the data). We also couldn't treat region chunks as black boxes, since they contain pointers to other regions that would have to be updated when regions are moved.

One idea for handling this would be to generate tracing/copying code for each data type, and store pairs of (region_ptr,trace_func_ptr) in the shadow stack. Then the GC code would walk through the shadow stack and call the associated functions on each region pointer.

Also, if we have a moving GC, at the point that we do collection we need to change all live cursors. This is a bit awkward because the cursors are absolute pointers into chunks of memory associated with a region, and so it might be non-trivial to figure out the new address of the cursor in the new region chunks. I guess we could keep track of all live cursors in the shadow stack and keep track of when we "reach" them in the collection, and record what the new pointer should be for each one.

rrnewton commented 3 years ago

Re: runtime type information. I don't think we'd want to go the OCaml route of making the data self-fully describing (Scheme-like), but rather the Haskell way of storing out-of-line info tables.

For Gibbon, it would be critical to not store these per-heap-object, but rather closer to per-region.

Michael's idea about storing it ONLY for roots in the (shadow) stack is really cool too. No matter how big the heap data is, if you know the types of the roots, then you know the types of everything else. You could even imagine storing that info separately on a per-function basis, so that you use the return address to lookup a record for the types of the variables in the frame.

It seems so obvious ... why is this kind of strategy not used more? I think that existential polymorphism and higher-order code are both reasons. E.g. for a runtime representation of a Haskell closure, the type doesn't tell you the type of the closed variables.

But for Gibbon... working with that lovely first order monomorphic code and acyclic heaps gives quite a few nice invariants to leverage. The only tension would be if one wanted to do truly concurrent GC (as mentioned in #100) -- where a background GC thread would be picking up and compacting regions without interrupting the mutator / touching the stack.

AndrasKovacs commented 3 years ago

@rrnewton layout computation starting from roots sounds like tag-free garbage collection. Intensional polymorpism and "type-passing polymorphism" could be related as well.

ckoparkar commented 3 years ago

A PADL-2020 paper on MLKit using generational GC: https://elsman.com/pdf/padl2020-final.pdf.

rrnewton commented 3 years ago

@ckoparkar - it was a nice result that you got on the overhead of repeated consing (as in a simple reverse program, #126) being lower than expected for Gibbon vs a malloc-list.

Moving forward, it seems less exciting to do any incremental-improvements on small regions using Gibbon's current strategy (#100). It's always going to be making a suboptimal solution less bad. What seems more exciting is spending energy on one of the following two possibilities:

(1) dupable linear lists (internally mutable list-of-vectors), (2) a generational collector that is copying/bump-alloc in the nursery, and the current region-refcounting strategy in the old generation.

Re (1), this gets more at what we think is the optimal runtime representation for enviroments in compiler passes.

Re (2): The beginnings of a GC proposal in this old issue here didn't get to the point of describing generational collection. But generational collection, with heterogeneous representations, could be the thing that manages impedance mismatching between the types of programs that dump huge trees into regions, and the programs that allocate many small ("Bounded", #79) regions.

Generational Gibbon GC

Just to spell out the proposal a bit, the idea is that we simply use a traditional nursery, and every let-region turns into a fixed size bump-alloc in the nursery. If we can statically bound a region's size --- especially to a single heap object like a Cons cell --- then we alloc that size. If not, we alloc whatever the initial chunk size is, K, a relatively small value. (Ideally, we base these tunings on profiling data and I'm hoping that the compiler can recognize O(1) constant sized thing, and that the other regions are tuned to something reasonable for O(log(N)) regions that house a small, but dynamic, number of heap objects over their lifetime.)

We enumerate roots using an existing strategy (like @vollmerm mentioned). When the nursery is full, we evacuate it to Gibbon's regular, reference-counted heap of regions (malloc-managed). Thus if you allocate, say, a cons list out of order, you initially have heavy pointer chasing when you read the list. But once those elements graduate to the old generation, then you get the normal copying GC locality benefit on steriods, because the data becomes more densely packed.

One interesting wrinkle is what to do when growing a region in the nursery. There are several choices. You could keep the new (larger) chunks of the region in the nursery. But if we look carefully at data on region size distribution, I bet the bayesian answer will be that once a region starts growing it is likely to go large. It probably makes sense to allocate everything but the first chunk of a region directly into a reference-counted generation. But this would mean that, unlike the normal region growth process (where region identity, outset, and reference count is shared by all chunks), the nursery chunk 0 and the refcounted chunk 1 would effectively be separate regions. I.e. the redirection pointer at the end of chunk 0 would be no different than pointers from any nursery region into any older region.

A potential downside of this eager promotion would be that as soon as we grow a nursery region we start incurring the overhead of tracking outsets and reference counts when writing indirections. Whereas nursery=>old pointers are very cheap as usual; just like in deferred reference counting, they don't count against any reference counts tracked in the heap.

Hopefully, the end result would be that our speed of allocating (out of order) and traversing cons-lists would be no better and not-much-worse than existing typed functional language implementations --- which is quite good! Furthermore, in-order list operations like map should be much faster out of the box! (They would stream directly to a packed representation on the output, irrespective of how fragmented their inputs are.) From the perspective of programs like map, or the tree-traversals we've focused on in our previous papers (ECOOP'17, PLDI'19, ICFP'21), there should be minimal overhead for a brief pitstop in the nursery. Just one chunk gets copied!

Complications

The nasty bit is dealing with the heterogeneity of indirection pointers. If they point into the nursery, they point directly at a (packed) heap object with no region-footer objects for metadata. Most of these objects will not get any benefit from packing, and in fact will have a little bit of inefficiency due to extra I tags compared to a traditional ML/Haskell implementation.

Pointers within the old generation will work as they do now, and pointers from the nursery to the old generation will need to make it possible to reach the footer. Right now, Cursorize manages to have available end-pointers for the from/to regions whenever an indirection is written. This makes reaching the footer objects trivial, because we use the footer pointer as the end pointer. The GC, when it is copying a pointer-to-old from the nursery to the old generation, will effectively be executing this indirection-creation code without the benefit of that compile-time information.

The GC will need a runtime way to find footers that correspond to an arbitrary address in the old generation. Of course, distinguishing nursery and old generation pointers based on addresses will be trivial, as these can be non-overlapping address ranges. We will only be allocating nice, clean, power-of-2 chunk sizes, and we can further expect that our memory management system should be able to tell us the size of the allocated chunk containing an arbitrary pointer. If we always put the footer at the end of that malloc'd block, then we can

There's already the not-really-portable malloc_usable_size, but I don't actually know if glibc ptmalloc implementation of it let's you give it arbitrary pointers, in the middle of allocations, or just the pointers that were returned from malloc(). (Easy to test.) There also may be some extra motivation to customize the allocator to guarantee that when we malloc(2^n) that we always get back a precise allocation, with no extra overhead at the end. Otherwise we'd need to call malloc_usable_size every time on the pointers we get back from malloc, just to make sure we know the true end of the block.

Footnotes

Using tracying/copying in the nursery and reference counting after that makes this similar to Ulterior Reference Counting (2003). However, with the complication of packing and region-level reference counting rather than object-level reference counting, of course.

rrnewton commented 3 years ago

@rrnewton layout computation starting from roots sounds like tag-free garbage collection. Intensional polymorpism and "type-passing polymorphism" could be related as well.

Thanks! I hadn't read this. IMO, some of these older papers are so refreshing to read because they assume less context and explain the problems clearly from scratch, and their citations in turn are fewer degrees removed from bedrock foundations ;-).

We do indeed like tag-free GC for Gibbon. We also get to solve a much easier problem than Tolmach'94 as long as we lean on our closed-world assumption (whole-program compilation) and compile only monomorphic, first-order programs in the backend, and we also avoid mutable data on the heap. (Except for arrays, atm, which are managed with linear types.)

The cool thing about tag-freedom in the above generational proposal is that we really only need type info as deep as the nursery. Old generation data can be reclaimed without knowing what type it is (reference count hits zero, drop the entire region).

rrnewton commented 2 years ago

Generational GC musings

Prerequisites

The basic setting here is one of a generational collector. The nursery is bump-allocated, but still can allocate "region chunks", meaning two levels of allocation (allocating a chunk in the nursery, and then writing packed data to the chunk). Data in the old generation would still be organized into regions with refcounts and "outsets". This matches what Chai has alreday started implementing.

For the following, let's assume that we have a shadow stack that stores all live cursors in procedures with frames on the C stack. This doesn't mean that all cursors need to be "homed" to stack/memory, but it does mean that we should explicitly save cursors live across function call boundaries. E.g. this:

   let cursor = ...
   funcall(x,y,z)

Would become:

   let cursor = ...
   <save cursor to shadowstack>
   funcall(x,y,z)   
   <restore cursor from shadowstack>

We wouldn't save/restore across allocation sites, because the common path will be that no GC is triggered. But on the slowpath in the inline allocation code, we would push the current frame's cursors onto the stack before calling the collector.

Special tags: Cauterized, Copied

A couple things quickly become clear as we think about implementing the collection algorithm itself. First, you need to know when to STOP copying data. The mutator could be in the middle of allocating a tree structure, when it triggers collection:

┌────────────────────────┐
│ N N L 3 ..uninit..     │
└────────────────────────┘

The copying GC can't just keep rading into the uninitialized data! Therefore, the first step of beginning a collection is to read all the output locations from the shadow stack (which could be stored seperately from input locations perhaps), and to cauterize them by writing a special tag that means "preemature end of data here".

Likewise we can have a special tag for when the data has already been copied by the GC to the to-space. (See below.)

Challenge: no pointers from old=>young generation

Let's say we're runnning a basic tree traversal like good old add1. We allocate the first chunk, say 1024 bytes, in the nursery. Then maybe we double it a couple times as the output grows bigger, allocating new chunks. As these get bigger, it would be good to do "large object handling" by sending them straight to the malloc heap. Indeed it would further be nice to do "eager promotion" and promote those later chunks directly into the old generation.

Unfortunately, eager promotion won't work unless we could handle pointers from the old=>young generation in some kind of remembered set. There's nothing to stop us writing an indirection in the region that points to other nursery-allocated data. Alas, this problem was not recognized in my earlier comment so some of that is invalidated.

If we are satisfied with allocating all data logically to the nursery (whether or not is is bump alloc'd or malloc'd), then we have to copy everything on its way to the old generation. Long-lived data must be copied at least once. At least it would get compacted in the process.

Sharing challenges

Normally GC's promise to retain any sharing present in the from-space. In a purely functional language, you can cheat on this a bit (e.g. GHC has a low probability of duplicating pure objects as a result of racing parallel GC threads). But if we write the following function, we probably don't want the space usage of the resulting tree to balloon from O(N) to O(2^N):

data Tree = Leaf Int | Node Tree Tree

sharetree :: Int -> Tree
sharetree n =
  if n == 0
  then Leaf 99
  else let child = sharetree (n-1) in
       Node child child

But maintaining sharing while evacuating packed data is HARD. You can't do the simple thing of marking all copied data with a forwarding pointer. That depends on having enough space for the forwarding pointer. But for binary tree data in the heap, you don't have it:

┌─────────────┐
│  N L 1 L 2  │
└─────────────┘

Specifically, there are 3 heap objects here. The two leaves happen to have enough room for a forwarding pointer, but the intermediate Node doesn't. We need some complicated scheme to retain sharing while evacuating this data. For example:

The correctness of this depends on guaranteeing two properties:

  1. We always end with a CopiedTo, and
  2. there are no indirections inside the interval (before or after copying), which would cause it to be a different size in th from and to space and thus invalidate offsets within it.

We are able to guarantee this because indirections themselves provide exactly enough room to write a CopiedTo tag and forwarding pointer instead.

Region/chunk granularity in the output of the collector

At one extreme, the collector could output a single big region, the entire "generation" coalescing into one region with one reference count. This would not, however, be good for fragmentation when part of that data subsequently becomes dead.

It's hard to even define what a perfect number of regions in the output would be, because it would depend on some tradeoff between the (unknowable) future lifetime of objects, and the space+time overhead of indirections. Even if two portions of the same tree had no sharing, they could still potentially do better in different regions if they had very different remaining lifetimes.

A compromise would be to try to cluster the heap into disconnected components. Each time you start tracing another root, through the nursery, you start a new chunk for it. Well, first you check what it points to---if it's already Copied you just need to update the root to point to the new to-space object. If not, you start a fresh chunk. This could also be a fresh region, but that's a policy decision.

Then when we reach data which is already copied, that's where we need to insert an indirection. That indirection can either unify the regions (different chunks, same region), or leave them separate (inter-chunk pointer which contributes to outset). Which approach to use is a tuning question. Combining regions would save overhead on region-metadata objects and on outset & refcount tracking. Leaving them separate would allow distinct lifetimes for the different chunks of data. We could use empirical profiling data to inform this choice.

The (deferred) reference counting story

In Gibbon's prior, single-generation memory management strategy, the reference counts were always accurate, but the "let-region" stood in as a single reference count representing all of the register & stack contributions to the reference count. This was a rather nice approach where a single decrement allows us to "disconnect" the registers/stack from the region, and the only thing keeping it alive will be heap-based indirections from other regions.

Once we have a nursery, we need to account for that in any old-generation reference counting scheme. At the beginning of a minor collection, the root set (shadow stack) accounts for all live data in the system. At the end of the minor collection, the root set still does, except all pointers in it are updated to point to the to-space in the old-generation.

Here we can use a standard deferred reference counting trick. The reference counts actually stored in memory, in region metadata records, count only pointers from other regions. Those reference counts are allowed to be zero. Some of these zero reference counts will be incremented by new regions created during minor collection. Add a virtual "+1" reference count to each region that is directly pointed to by a rootset pointer. Any regions that still have zero reference count at that point can be freed (recursively decrementing their outsets, as usual). To make this deferred scheme work regions with zero reference count can be stored in a Zero Count Table (ZCT).

Question: Is there any code we can profitable run at the exit of a let-region in this world?

Program Analysis passes that make the story above better

The GC approach sketched out above would work better if we could statically analyze a region and classify it in one of the following categories:

  1. Indirection-free (contains only plain data)
  2. "rightward" local indirections only,
  3. "local" indirections only (to the same region, left or right)
  4. "no sharing" regions that should have no indirections after they are copied.

(1) is the strongest, and it implies (2) which in turn implies (3). (4) is a little bit unrelated, where (1) also implies (4), but (2&3) don't imply (4).

Each one of these can enable potential optimizations. For example, (3) is strong enough to guarantee an empty outset. This could allow eager promotion into the old generation. If executing add1, we could promote to the old generation, without any worry of old=>young pointers. (E.g., when we get to the second or Kth chunk).

(1) and (4) guarantee that we can turn data into fully serialized, mmappable data. (2,3) only do if we switch to relative-offset intra-region indirections rather than absolute memory addresses.

There's also a 5th property not to do with indirections:

  1. Bounded size: no more than X bytes will be written to the region.

For example, the reverse-a-list function should have the property that each region gets only a single Cons cell. In this case, there is no reason to create a sizable "chunk" for it, rather bump-allocate only the size needed for one packed Cons cell.

Open questions: Major collection

Nothing above addresses old-generation collection. We could use any strategy we want here, but any strategy that costs O(live objects), for any continuously allocating workload, is unfortunate as it counters what should be the benefits of the region-based strategy. It should be possible to get at least long-lived immutable data into a place where it is never traced again (like CNF).

Instead, we want more a targeted approach that reduces fragmentation in the old generation, separating the still-live data from dead data intermingled with it. Without this, we know only that at least one object (one byte!) is live within a live region. But we don't bound region size, so that doesn't provide us even a bounded worst-case fragmentation.

Introducing: Depth tracking

We can define the depth of a region as zero when its outset is null, and 1 + the max depth of any other region in its outset otherwise. The heap is acyclic, so every region has finite depth. Every time we collect, we add new regions with depth greater than the older regions they point to. Thus the maximum depth of the heap increases monotonically. Alas, it increases not one at a time, but probably up to O(roots), depending on the aforementioned coalescing strategy.

Old-generation collection could be a matter of managing depth. Programs that create fragmentation will be those that quickly increase depth, like repeated tree-insert. As long as we don't track inbound pointers into a region, we cannot asynchronously evacuate any region at random. But there are at least two places we could start:

The second approach sounds more promising. You can evacuate a single region (defragmenting) and hopefully regain space without reducing depth. Or you can smash together everything of max depth D and depth D-1, creating as little as one resulting region, and reducing max depth by one without compromising sharing. You could smash together the deepest K levels, up to and including doing a full major collection.

If you remember the the frontier of outbound pointers while you do that (into depth D-2 and below), you could also potentially delay further collection/defragmentation and have a concurrent algorithm to do it in the background. This is complicated by the fact that the "frontier" isn't just between old-gen regions, old shadow stack frames can point to arbitrarily low-depth regions in the old generation.


  rootset                               frontier           old generation              
                                            │                                          
   ┌───┐                                                                               
   │ ──┼────────────────────────┐           │           ┌─────────────┐                
   │   │                        └───────────────────────┼──▶  ...     │                
   │ ──┼──┐                                 │           └─────────────┘                
   │   │  │                                                                            
   │   │  │             nursery             │                ┌────────────────────────┐
   │   │  │                                      ┌───────────▶        ▲ ...           │
   │   │  │  ┌───────────────┬───────────┐  │    │           └────────┼───────────────┘
   │   │  └──┼─▶ T S I ...   │    ... ───┼───────┘                  ┌─┘                
   │   │     ├───────────────┴───────────┤  │       ┌───────────────┼────┐             
   │   │     │           ....          ──┼──────────┼──────▶ ...    │    │             
   │   │     ├─────────┬─────────────────┤  │       └────────────────────┘             
   │   │     │   ...   │       ...  ─────┼──────┐                                      
   │   │     └─────────┴─────────────────┘  │   │           ┌────────────────────┐     
   │   │                                        └───────────▶  T S T T S I ...   │     
   │   │                                    │               └────────────────────┘     
   │   │                                                                               
   └───┘                                    │                packed data:              
                                                     tags, scalars, indirections       

The thing we don't want to do is evacuate singleton-in-set regions that have already been evacuated in the past. That is, they are depth 0 and have no fragmentation because they consist entirely of packed tree data with a single root. There's probably no way to know this, without having a write-barrier that updates inbound-sets as well, which is expensive. Maybe we could track up to one inbound pointer, and only if it is to the head of the region (1 bit). As soon as that pointer is removed, the region loses this guaranteed-zero-fragmentation status. The problem is that when we decrement the reference count because of an upstream region being freed, we don't actually have the inbound pointer to check if it is at the head. We could go as far as splitting the outset into the union of two subsets: "loc-0" outbound pointers and others. Pointers to location 0 exempt their target regions from any further compaction/collection.

Maybe those loc-0 outbound pointers even stop contributing to "depth"...

Open questions: parallel collection

Everything above ignores parallelism. We're targeting "disentangled" executions, where each thread can have a private nursery. We cannot guarantee that all writable cursors point into the nursery. We should be able to achieve independent/parallel per-thread minor collections. But starting on the major collection probably requires synchrounous minor collections on all threads, to compute the frontier into the older generation.

Based on the nested structure of let-regions (plus disentanglement), maybe we can further do something to track which regions are thread-shared and on which threads. Maybe we could dynamically mark thread escaping regions.