justinethier / cyclone

:cyclone: A brand-new compiler that allows practical application development using R7RS Scheme. We provide modern features and a stable system capable of generating fast native binaries.
http://justinethier.github.io/cyclone/
MIT License
823 stars 42 forks source link

Question about GC #356

Closed minad closed 4 years ago

minad commented 4 years ago

Hi @justinethier,

thanks for answering on https://github.com/libtom/libtommath/issues/429. Nice to hear that you are indeed using a Doligez-Gonthier-Leroy design for the heap. I am currently experimenting with something like this as well (but not in a scheme implementation). So I would very much like to learn a few things. In particular I wonder how you handling mutable objects. You probably don't allocate mutable objects directly on the major heap as in Doligez et al. since mutation is pervasive?

The interesting case is in particular when you write to a global mutable object, adding a reference to a local object. Then the local object has to be promoted to the major heap. Immutable local objects can simply be copied, but for mutable local objects some forwarding has to occur. The ocaml-multicore people use a reference table to fix all pointers to the promoted object in the minor heap. And if there are too many of them, a full minor collection is triggered. Another alternative would be to access all mutable objects through a forwarding pointer (Brooks or conditional forwarding pointer).

I've seen some forward_tag in your code base, but I didn't dig deep for now.

Thank you!

justinethier commented 4 years ago

Hello @minad !

That's a good question. First some background:

Cyclone's implementation is based on the paper "Implementing an On-the-fly Garbage Collector for Java" which is based on DGL. This might be better suited for a language with mutable objects.

For Cyclone, a write barrier is used to detect cases where a heap object points to a local object. We then forward the local object to the heap during minor GC. In fact, minor GC takes care of forwarding all objects from the current thread; this is how we use the forward_tag you mentioned.

One issue with our approach is that there is a time gap between when the local object is added to the heap and minor GC is performed. As you mentioned, it is problematic to have thread-local objects in the heap because other threads could try to access them before they are fully-copied. Currently we solve the problem in application code, either by recommending a library we provide that guarantees object safety or by following our documentation.

I would prefer to have a solution that is fully-handled by the GC, though. For objects without child objects this is easy; they can just be copied right away. But I do not have a good general-purpose solution. Our implementation is also complicated by using the C stack as the minor GC nursery, so forwarding pointers do not survive minor GC. You likely do not have such a restriction.

Anyway, does that help at all? I am interested to hear your thoughts.

Just curious, are you planning to build a GC for your own language?

minad commented 4 years ago

Cyclone's implementation is based on the paper "Implementing an On-the-fly Garbage Collector for Java" which is based on DGL. This might be better suited for a language with mutable objects.

Ah, good to know. I've seen that paper before but didn't read much through it since I am working in a setting with more immutable objects, more along the lines of Doligez et al. But it makes sense in your case!

One issue with our approach is that there is a time gap between when the local object is added to the heap and minor GC is performed. As you mentioned, it is problematic to have thread-local objects in the heap because other threads could try to access them before they are fully-copied. Currently we solve the problem in application code, either by recommending a library we provide that guarantees object safety or by following our documentation.

I think I've read something like this in a paper about the manticore system. They are basically introducing some kind of reference type which is allowed to point from the major/shared heap (heap in your mta terminology) to the local heap (called stack in your terminology). This reference type must be unwrapped explicitly, it is basically an explicit read barrier. ~Do you throw an exception on invalid access?~ Edit: I looked at your user manual and it seems you simply allow race conditions, which results in lost writes. That's a valid design but I think it could be hard to analyze which objects need this kind of explicit barrier on the application level. And if you use it for too many, things will be too costly - then you could use a read barrier right away.

I would prefer to have a solution that is fully-handled by the GC, though. For objects without child objects this is easy; they can just be copied right away. But I do not have a good general-purpose solution. Our implementation is also complicated by using the C stack as the minor GC nursery, so forwarding pointers do not survive minor GC. You likely do not have such a restriction.

I've not found a good solution for the problem yet. For better understanding I try to summarize the different approaches I looked into:

Just curious, are you planning to build a GC for your own language?

My language is functional, mostly immutable and of a SML/Ocaml type. But it is still work in progress and not yet published. The approach I am using is similar to the approach taken by Manticore. The difference to Manticore is that I am much more immutable, so a read barrier for the few mutable objects is hopefully not too costly. I am also thinking about adding such an explicit boxed type to explicitly allow shared->local references. There is one exception to the no-reference-from-shared-to-local, it is allowed to have references from shared thread objects to a local stack object.

The overall design consists of multigenerational local heaps collected by Cheney and a shared heap collected by concurrent and parallel mark and sweep. Having multiple local generations adds to the complexity of the write barrier. The write barrier has to track references between the generations. Additionally it grays objects during marking (Yuasa-style barrier). I decided to not use the Cheney on the MTA design since I wanted to decouple allocation/GC and the implementation of continuations. I am using heap-allocated stacks, this might be similar to what Gambit does. Since I am also compiling to C, I also have the issue of tail calls to implement continuations. I am "solving" that in a very ugly way by using a compiler plugin for gcc and llvm. So I am not strictly compiling to C, just abusing C for codegen.

justinethier commented 4 years ago

This is very informative, thank you!

My main concern with read barriers is how expensive they could be, given the frequency of reads vs writes, but this may be the only way to solve the problem automatically.

I see Manticore's website here: http://manticore.cs.uchicago.edu/ Do you know if the paper you were reading is linked there?

Just curious, how does your compiler plugin work?

minad commented 4 years ago

I see Manticore's website here: http://manticore.cs.uchicago.edu/ Do you know if the paper you were reading is linked there?

I can find you the reference, but I don't have it at hand right now.

My main concern with read barriers is how expensive they could be, given the frequency of reads vs writes, but this may be the only way to solve the problem automatically.

There are other approaches, but I either didn't understand them well enough or dislike them for other reasons. For example there is also the option to always trigger a minor gc on such a promoting write. Then there are papers using some kind of cleanliness information to elide barriers (?), but I didn't understand those fully. See http://kcsrk.info/publications. There is also a paper by kcsrk regarding multimlton, where the measured read barrier overhead was very large. But then if you consider some of the newer java gcs, e.g. Shenandoah, they are using a Brooks forwarding pointer on every object. I think I've read something like 10-15% overhead.

Just curious, how does your compiler plugin work?

It is trivial. Both compilers already support tail calls in their internal immediate representation. The plugin only has to expose it. Functions must be tagged as __tailcallable__ and tailcalls must be tagged as __tailcall__. I've put it here https://github.com/minad/tailcall-plugin.

minad commented 4 years ago

I see Manticore's website here: http://manticore.cs.uchicago.edu/ Do you know if the paper you were reading is linked there? I can find you the reference, but I don't have it at hand right now.

Sorry, I cannot find it right now. Maybe I didn't remember some things correctly. Could also been mentioned in a multimlton paper. Basically the idea I meant was to never allow shared->local references, except for a special type, which can only be accessed using an explicit read operation, which involves a read barrier. But I am not sure if something like this would be helpful in your case since you still have mutation all over the place. I think if you want to have it automatic the best bet would be to introduce a read barrier. Since you are working with a scheme you still have tag checking going on at runtime, so maybe the cost of a read barrier would even be acceptable or could maybe somehow be combined with the tag checking. Basically at the point where you are checking the tag, if you observe a forward tag, the tag check fails and you could also load the forwarded reference. I have no idea however if something like this could really work.

justinethier commented 4 years ago

For example there is also the option to always trigger a minor gc on such a promoting write.

Intuitively this seems like a bad idea but that is effectively how our concurrency library works. Let me experiment a bit to measure the performance impact. Our minor GC should be cheap if there are only a few objects to move (which would be the case for mutations made quickly EG in a loop). And if the object has no children we can just copy that single object, no further copying is necessary. Basically the logic from runtime function Cyc_make_shared_object.

Besides extra GC's the major pain point is that it will be difficult to inline mutable primitives set-car!, set-cdr!, and vector-set! since if such an operation would introduce a bad reference it is necessary to relocate the new object first, which affects control flow if minor GC is invoked. IE, we need a write barrier before the operation and it is necessary to pass the current continuation to the barrier.

minad commented 4 years ago

Intuitively this seems like a bad idea but that is effectively how our concurrency library works.

This is my feeling as well. In particular in a language with pervasive mutation. In a language with very few mutations and only very few non-local mutations I think you could live with such an approach.

But I believe everything is also very much library-design dependent. If you libraries are designed such that most things are local, you can make different choices. You could even design a erlang-style library which keeps everything local and only promotes messages to some shared message queues. In your case where you are providing the concurrency libraries and the language together you have a lot freedom to experiment at least :)

Let me experiment a bit to measure the performance impact.

I am very much looking forward to such results!

And if the object has no children we can just copy that single object, no further copying is necessary.

This I don't understand since you still have to fix references if you are promoting mutable objects (my starting question from above). Probably you mean the simple cases which don't require further actions, like promoting immutable objects?

Besides extra GC's the major pain point is that it will be difficult to inline mutable primitives set-car!, set-cdr!, and vector-set! since if such an operation would introduce a bad reference it is necessary to relocate the new object first, which affects control flow if minor GC is invoked. IE, we need a write barrier before the operation and it is necessary to pass the current continuation to the barrier.

Yes this is a problem. Right now I am not inlining writes, they are calls to an external write function + write barrier. Ocaml does the same - they use a call to caml_modify. This is all very expensive unfortunately. In my case I could try to inline a small fast path which could be used for unboxed values or if the objects live in the same minor generation. Note that in my case I have a generational minor heap, which also requires this expensive book-keeping. Maybe the multi-generational minor heap is not a good idea, but hopefully it keeps the pressure on the major shared heap low. And in a language without mutation you are allocating much more - so the generational hypothesis holds very well.

Also my design is also somehow grown since I went the full evolution from 1. stop-the-world cheney, 2. stop-the-world multigen cheney, 3. stop-the-world multigen cheney+incremental marksweep for major heap, 4. stop-the-world multigen cheney+concurrent parallel marksweep, 5. only-locally-stopping multigen cheney + concurrent parallel marksweep. Maybe I should kick out some things and simplify. But I feel complexity is still somewhat under control ;)

justinethier commented 4 years ago

Thanks for the feedback. Let me experiment with this and see how bad it is. I have a large benchmark suite to test with and even though this will take some work it will be much easier to get something up and running with this change as opposed to read barriers.

This I don't understand since you still have to fix references if you are promoting mutable objects (my starting question from above).

Good catch, objects that are either mutable (refs to them may exist on the stack) or have children (child objs may be on the stack) cannot simply be copied.

Just for my notes, a write barrier will also be needed for global_set (set! of a global variable).

Your GC sounds very interesting. Is there any more information available online or is it too early for that?

minad commented 4 years ago

Your GC sounds very interesting. Is there any more information available online or is it too early for that?

Not yet, unfortunately. But I think the overall GC design is not significantly different from your design, manticore or ocaml-multicore. Multiple pretty standard components glued together. But there are many crucial details, which are usually not talked about much. You only find them when you implement the thing - you probably observed that too. I can ping you when I am releasing something, but it still needs time.

justinethier commented 4 years ago

100% agreed, the devil is in the details. Just one bug and the whole house of cards comes crashing down :)

I suppose besides correctness the big concern is performance. There were a couple of small bugs in my GC that took a long time to find because they did not affect correctness (everything worked fine) but by holding locks too long or caching too much data the GC was not running at full speed. At a high level, using a lazy sweep significantly improved performance as well.

One thing I struggled with was when to call the major GC. I suspect this is dependent to an extent on both on the collector and language. I just recently found this article which would have been helpful at the time (and is still a good intro read) https://www.craftinginterpreters.com/garbage-collection.html#when-to-collect

Please feel free to ping me when you are further along. It's fun learning more about GC especially after having spent a considerable amount of time working on one :)

minad commented 4 years ago

100% agreed, the devil is in the details. Just one bug and the whole house of cards comes crashing down :)

Yes. My experience was also somehow that a GC bug is triggered quite often since the code paths are executed so often and it impacts the whole system. However debugging is hard since the crash often occurs much later.

At a high level, using a lazy sweep significantly improved performance as well.

This is something I would also like to know more about, in particular about your performance gains. I am not doing lazy sweeping since I suspected that this could lead to unexpected latencies during heap allocation. I am using concurrent marker and sweeper threads. The sweepers and mutator threads acquire whole segments of the heap which are then used for allocation/sweeping without contention and with good locality. Each segment has its own free lists. The downside of that approach is that a little more memory is used since each thread has its own segment which is released from time to time. Maybe there are better approaches?

Lazy sweeping would probably mean in my case that the mutator will perform the sweep of a yet-unswept segment as soon as a mutator cannot find a segment which is sufficiently free? But I would like to understand why that is better than using a separate sweeper thread (assuming there are sufficiently many cores). In the case without thread support/single threaded I am performing incremental collection. Then I suspect that lazy sweeping would be better in that case?

One thing I struggled with was when to call the major GC.

I will take a look later at the link. How are you solving that now? I am using an allocation counter and after a certain accumulated allocation size (percentage of the last used heap size), I trigger the major GC.

It's fun learning more about GC especially after having spent a considerable amount of time working on one :)

Yes, definitely!

minad commented 4 years ago

The solution is to add a new color (purple) to indicate garbage objects on the heap. Garbage can then be swept while the collector is busy doing other work such as mark/trace.

I found this in your document https://github.com/justinethier/cyclone/blob/master/docs/Garbage-Collection-Using-Lazy-Sweeping.md. Is this one reason for the performance gain that you can sweep concurrently?

The solution you are using sounds elegant. But as far as I understood, sweeping is performed incrementally by the mutator threads? In my design I tried to offload the work from the mutators, but this makes only sense assuming there are some spare cores with less load. If you optimize for throughput, the lazy sweeping design should win and this is what you observe, I assume?

You also mention that the benchmarks profiting the most are the ones where a lot of short living objects are produced which end up in the major heap. Since I am using multiple generations for the minor heap, I can somehow tune that a bit and delay premature promotion to the major heap. Multiple minor generations which are collected using copying would also be possible in your design as an intermediate step if the complexity would be worth it (but I suspect your gc is already performing very well).

I also have to admit that I don't have such a sophisticated benchmark suite as yours. So I have a hard time to give you solid numbers. I am using some event logging/tracing for plotting and to measure pause times and have a few simple benchmarks. Maybe at some point I thought about doing some GC autotuning to see which setting make sense - a million things to do.

justinethier commented 4 years ago

Right now the collector initiates major GC when the amount of free memory drops below a threshold. I am not sure if this is the ideal solution but it works well enough in practice.

Regarding lazy sweeping, part of the reason for the performance gains is that after this change allocation and sweeping no longer required any locking in my implementation. It sounds like you already handle this by dedicating each segment of the heap to either a mutator or sweeper. Are you already performing an incremental sweep of the heap, then? You could already have realized much of the performance gains.

Several of the programs in this benchmark suite were ported from other languages. If you are looking for more benchmarks you could certainly port some of them :)

On the other hand I do not have good data on GC latency. It would be interesting to collect more information on that front.

minad commented 4 years ago

Regarding lazy sweeping, part of the reason for the performance gains is that after this change allocation and sweeping no longer required any locking in my implementation. It sounds like you already handle this by dedicating each segment of the heap to either a mutator or sweeper. Are you already performing an incremental sweep of the heap, then? You could already have realized much of the performance gains.

Yes, each mutator/sweeper thread acquires a dedicated segment. If a segment is full/swept, it is released back to the heap. Locking is only required when these segments are acquired/returned, but no locking is needed when operating on the dedicated segments. I made this design decision in the quite early since I found high contention when I used a shared heap datastructure. Instead of trying to fiddle with some complicated lockless datastructures for the heap, I decided to avoid the problem altogether and simply use dedicated segments at the cost of higher memory consumption.

It is possible to configure the gc to perform incremental sweeping. In the case with concurrent sweepers, the sweepers acquire a segment, sweep it fully and release it back to the heap. Sweeping is incremental in units of segments. If no concurrent sweeper threads are used, the mutator threads perform incremental sweeping, they can then acquire a segment, sweep it fully or only partially if time is up and release it back to the heap. Time limiting the sweeping is probably not really useful since segments are not very large (but I wanted to experiment a bit with latencies). I think 2MiB per default for small object sizes. Large objects are allocated much more slowly directly from the page allocator.

In your case is it possible to tune things more/smaller segments etc to save memory? Is it lockless? Are you using a more complicated datastructure? I've seen papers by Ohori/Ueno which also describe heap datastructures in detail, but I didn't look deeply into those.

Several of the programs in this benchmark suite were ported from other languages. If you are looking for more benchmarks you could certainly port some of them :)

I will certainly do that, but regarding the language a few things are also still in flux.

On the other hand I do not have good data on GC latency. It would be interesting to collect more information on that front.

As I mentioned before, I am using a event system which basically writes out event datastructures (timestamp and some payload) at many positions in the runtime if enabled. I made my own implementation for that, but I could also recommend that you use lttng or systemtap/dtrace. This way you should get something up and running in an hour. Lttng writes the data in a quite nice binary format called ctf and there are other ongoing projects to analyze the resulting tracelogs.

justinethier commented 4 years ago

So far the change is proceeding well. A new write barrier has been added to the applicable mutation functions that will intercept stack objects being referenced by the heap and either copy them to the heap or initiate minor GC.

Direct mutation of a global variable (EG: (set! x (+ x 1))) is unfortunately harder to unwind and I still need to get around to that.

I can post more numbers but across all the benchmarks there is a ~3% performance penalty for removing CPS for mutable operations and a 2% penalty for the new write barrier.

Some benchmarks are hit worse than others though. destruc was particularly affected:

Previous - 19.634847 No CPS - 24.424724 New WB - 25.876992

Note that the write barrier itself is not imposing much of a penalty. Most of the performance impact was in losing the ability to inline calls to mutation primitives.

Regarding latency, there is some initial data here for minor GC. I would be curious to know your thoughts: https://github.com/justinethier/justinethier-misc/tree/master/data/gc-latency

minad commented 4 years ago

A new write barrier has been added to the applicable mutation functions that will intercept stack objects being referenced by the heap and either copy them to the heap or initiate minor GC.

Just for me to understand - If immutable objects are referenced by the shared heap they are copied and in the case of mutable objects, the minor GC is triggered to avoid the race condition we discussed before?

Do you have numbers on the impact of the minor GC? Before implementing the change, you triggered the GC from the library level somehow, right?

I can post more numbers but across all the benchmarks there is a ~3% performance penalty for removing CPS for mutable operations and a 2% penalty for the new write barrier.

What does it mean to remove CPS? Is the write barrier called in tail position and always longjmps to the top of the stack? But this always implies a minor GC for cheney on the mta as far as I understand?

I think I don't understand this fully. It seems to me that there are three things:

  1. Losing the possibility of inlining the fast path of the write barrier (It would be interesting to see if inlining could be restored partially or completely in some cases by doing a little bit of static analysis etc?)
  2. Change of control flow, removing CPS. It fully clear to me how code looked before and after, maybe you could show some pseudo code?
  3. Additional penality for the new more complex barrier (more branches etc)

Just for comparison in my system the cps'ed code looks somehow like this:

cont(foo) {
    prologue; // setting up some local variables/registers
    ...
    sp[0] = a; // stack manipulation
    sp[1] = b;
    sp += 2;
    ...
    write_barrier(&a, index, b); // slow call into the runtime, not inlined!
    ...
    a[index] = b; // if we wouldn't need the write barrier, we could do something like this
                  // initializing heap objects for example
    ...
    tailcall(bar); // this needs the gcc/llvm plugin
}
cont(bar) {
    prologue;
    ...
    tailcall(foo);
}

Regarding latency, there is some initial data here for minor GC. I would be curious to know your thoughts: https://github.com/justinethier/justinethier-misc/tree/master/data/gc-latency

I will take a look!

justinethier commented 4 years ago

Basically before the change, one of these functions such as Cyc_set_car could be called at any time because it did not affect the control flow:

   func() {
     Cyc_set_car2(...)
     foo();
     bar(); 
     ...

Now the function receives a continuation argument, and if minor GC is triggered the function must call into that continuation. This is how minor GC works in our runtime. Apologies for the confusion, but this is what I mean when I refer to CPS. Also, I meant to write there is a performance penalty for forcing these functions to always use CPS:

   func() {
     Cyc_set_car2(k, ...)
   }

I suppose technically it might be possible to inline these functions but it would be tricky because there would need to be two control flow paths depending upon whether GC is required.

Does that help? What are your thoughts regarding your GC?

minad commented 4 years ago

Regarding latency, there is some initial data here for minor GC. I would be curious to know your thoughts: https://github.com/justinethier/justinethier-misc/tree/master/data/gc-latency

How large did you chose the stack size/size of the nursery? Usually the pauses look very good - I guess mostly when not many objects are promoted to the major heap. However there are these larger spikes 30ms etc. It would be interesting to see what the pauses are if you test worst case scenarios - basically filling the whole nursery with very small objects which are all kept alive for a while, such that they are all promoted.

In my system at least promotion to the major heap is comparatively expensive in contrast to nursery/minor heap bump allocations. Therefore the collections of the young generations are usually relatively cheap except for the collection of the oldest minor generation, when many objects are finally promoted to the major heap. The slowness comes from using log2 segregated free lists. The memory accesses are much more stochastic and there are many more cache misses. And I am already using per thread dedicated heap segments for allocations.

This is some area where maybe a bit more investigation into other malloc designs would help. But I am not so sure if much can be done about it, maybe using some kind of method to ensure better locality, such that related data is closer together etc. Right now I am just trying to reduce fragmentation. Using dedicated heap segments also probably helps a bit with the per thread locality.

Another possibility could be to adjust nursery size dynamically based on survival rates such that latencies are kept low at the cost of throughput if that's the goal (I implemented something like this). Furthermore the number of generations could be adjusted dynamically, dynamically allowing to skip older generations and promoting directly to the major heap (I didn't try something like this yet). This would reduce the costs during times when most things end up in the major heap anyway.

So I talked more about what I am trying now - but maybe it is still interesting for you!

minad commented 4 years ago

Does that help? What are your thoughts regarding your GC?

Ok, I understand. In my design the write barrier is unable to trigger a minor GC as of now. For that to work, it would also be required to pass a continuation as an argument as in your case. My GC is triggered by a heap limit check in the continuation prologue.

cont(foo) {
    prologue;
    if (hp + n > hl) {
        ...
        tailcall(interrupt); // allocate new block/run minor gc/other runtime interrupts...
    }
    ...
    val* a = hp;
    hp += n;
    ...
}

I suppose technically it might be possible to inline these functions but it would be tricky because there would need to be two control flow paths depending upon whether GC is required.

Yes it would probably be tricky and you would end up with a lot of code duplication too.

minad commented 4 years ago

About the impact of the write barrier with continuation - this is an interesting topic. If you write something like the following instead of always using the continuation, the normal control flow would continue with func_cont directly. In comparison with your current solution one should basically measure the difference of a ret+direct jmp vs an indirect jmp.

Regarding indirect jmps - I am also using them a lot if the tail call destination is unknown - I measured an unbelievable reduction in performance with activated spectre/meltdown mitigations. I think the kernel simply deactivated the branch predictor for indirect jumps. You probably have that problem too - I think many functional languages using some kind of cps conversion have that issue?

func_cont() {
    ...
}
func() {
    Cyc_set_car2(func_cont, ...); // the continuation is only used if minor gc is triggered
    func_cont();
}
justinethier commented 4 years ago

OK, this is a lot to digest, thank you :)

Cyclone uses the C stack as its nursery, with a limit of 500K. This is not quite as efficient space-wise as having 500K for a bump allocator, since we need to share that space with the C runtime. So function calls eat into it, etc.

Anyway, I suspect the spikes may be related to allocation on the heap rather than nursery size. Our lazy sweeps are performed by the allocator so every so often there is a slow path where we need to sweep up a heap page before we can allocate memory. These spikes seem short enough though I wonder if they will be a problem for certain applications (graphics, pseudo real-time, etc). Need more investigation here.

Your code example is intriguing. Most of the time Cyclone's continuations will be through function pointers rather than an explicit function call, so I am not sure if this type of pattern will help much with jumps? However, by returning from Cyc_set_car2 and then calling func_cont we do save a bit of stack space. That is a more efficient use of resources and is worth looking into.

My main concern right now is figuring out how to add a write barrier for set!. Not sure yet how difficult that will be...

minad commented 4 years ago

My main concern right now is figuring out how to add a write barrier for set!. Not sure yet how difficult that will be...

You could implement globals as pointers to heap allocated boxes. Then you wouldn't need a different barrier. Not sure if the overhead of that is acceptable however. I don't know how prevalent globals are in scheme or is their use more or less discouraged?

Anyway, I suspect the spikes may be related to allocation on the heap rather than nursery size.

Yes, they are related to allocations on the heap. However the larger your nursery is, the more living objects are allocated at once in the heap during the minor gc pause. In my gc, pause times scale linearly with nursery size/number of surviving objects.

Our lazy sweeps are performed by the allocator so every so often there is a slow path where we need to sweep up a heap page before we can allocate memory.

Can you measure the time needed for the sweeping separately? So that one knows how much of the pause time is due to sweeping and how much is to heap allocations.

These spikes seem short enough though I wonder if they will be a problem for certain applications (graphics, pseudo real-time, etc). Need more investigation here.

I think pauses in the order of 1-10ms are very much acceptable in soft real time applications. In your data there were some pauses like 70ms and that might be a bit too long for certain applications - but these longer pauses were also very rare. And it all depends on your goals. Some of the longer pauses could also be due to the os scheduler.

I would suggest that you measure how much time is consumed by allocations, in particular in the worst case scenario of filling up the nursery with many long-living small objects, which are then moved to the heap. In that scenario my GC needs about 15ms with a 1M nursery. If I make the nursery larger/smaller I get higher/lower throughput and longer/shorter pauses.

minad commented 4 years ago

Btw how do you handle equality/identity for the copied objects? If an immutable object lives on both the stack and the heap?

justinethier commented 4 years ago

Thanks for the feedback, let me look into a more careful analysis of time spent allocating / sweeping once the issues with globals are sorted out.

The same barrier will work for globals but there are a few complications with them due to how we track GC roots.

Good question about equality. Instances of the same object located in different places (EG: stack/heap) will not be true in the sense of eq?. But they will be true in the sense of = / equal? as those functions perform a deeper comparison. I do not think this is a significant issue but it is a factor a user may need to be aware of.

minad commented 4 years ago

Good question about equality. Instances of the same object located in different places (EG: stack/heap) will not be true in the sense of eq?. But they will be true in the sense of = / equal? as those functions perform a deeper comparison. I do not think this is a significant issue but it is a factor a user may need to be aware of.

I handle this similarly. I provide an unsafe identity primitive, which might give wrong results if objects are promoted. Basically the primitive only gives correct results for objects which are always allocated directly on the major heap. However I am not sure if such an approach is acceptable for eq?, since eq? might have very strictly defined semantics in the scheme standard or is its behavior implementation defined?

minad commented 4 years ago

Regarding the write barrier with the continuation argument - do you have some results why this leads to a significant slowdown? Now I am thinking about changing my write barriers such that they can execute a GC and this would require passing a continuation argument as in your implementation. Right now my write barrier is unable to run a local GC collection, but I have to a lot of work to circumvent that:

Therefore I am thinking if it is not better to trigger a minor/local GC collection if some of the bad but hopefully rare scenarios happen (1. promoting mutable objects to shared heap, 2. if an already promoted object is found during promotion)

You might wonder why scenario 2. is problematic - I am not using a hash table for the already promoted objects but a simple chunked list. I tried a hashtable but it was much slower in the common case. So when I promote an object, I set a bit in the object header. When an object is discovered where the header bit is already set, I have to scan the list of already promoted objects. This also means that I have to jump through quite a few hoops to ensure that this does not lead to a quadratic overhead.

justinethier commented 4 years ago

The issue for our system is that the continuation argument implies the code will call into the continuation. This means Cyclone cannot inline the call and has to use continuation passing style. This leads to overhead since there are more C function calls being made and more stack bounds checks being performed due to Cheney on the MTA. Depending on the code it could also mean there are other optimizations we cannot perform.

So it is really more of an optimization problem. As much as it hurts to lose some performance, correctness is more important here, so I am planning to merge these changes back into my main branch once everything has been cleaned up and tested.

Anyway, the good news is that none of this is directly related to GC or the write barrier. The actual impact to those is minimal; we run more minor GC's but without many objects to move there is not much overhead and this just means we call longjmp more often.

Apologies for my confusion, but why do you have to scan the list of promoted objects again when a promoted object is found via the header bit?

minad commented 4 years ago

Anyway, the good news is that none of this is directly related to GC or the write barrier. The actual impact to those is minimal; we run more minor GC's but without many objects to move there is not much overhead and this just means we call longjmp more often.

I find that hard to believe that this leads to so many more longjmps. But probably if you have many writes it really adds up to the stack usage. In my system I don't have this problem since everything is a tailcall via this plugin. Passing a continuation argument to the write barrier actually means pushing a continuation to the thread stack and tail calling the barrier. After doing its work, the continuation is taken from the thread stack.

Apologies for my confusion, but why do you have to scan the list of promoted objects again when a promoted object is found via the header bit?

The list of promoted objects is actually a list of (old, new) pairs. There is no object header, but the forwarded bit is stored in a bitmap at the beginning of the memory block. If an object is forwarded, the bit is set and a pair is pushed to the list of promoted objects. Now during promoting the transitive closure, an old object can be found twice due to cycles or due to a prior promotion. Then I have to find the corresponding new object. This means scanning the list. But I only want to scan the lists once per promotion of a transitive closure. The way I am solving this is by swapping the first object word and new. And after the whole promotion procedure all of this has to be restored. This sounds like a lot of work but things are still linear in complexity and it happens only rarely fortunately. The way I implemented this is also isolated from the rest of the code base, so it is not that bad for now.

The reason for all this complication is that I am using headerless objects and 64 bit pointer tagging/nanboxing. Object tag and all the things are stored in the 64 bit pointer to the object. But it could go all away if I simply put the write barrier always in tail position, thus making it a safe point. Then the garbage collector will take care of the forwarded objects.

minad commented 4 years ago

I did the move to safepoint write barriers. This simplifies things a lot (No read barriers etc...). The main difficulty was that I had to replace all the calls in the runtime where write barriers have been used freely.

I think in my case the new write barrier could even perform better than the old one, because C foreign calls are a bit of a mismatch from the continuation style of the code. C foreign calls usually involve a lot of register spilling etc. You might observe something like this too, i.e., that these unrolled tail calls allow different kind of optimizations on the assembly level in contrast to C foreign calls, which involve a call instruction and register spilling.

However on the higher level, there could be some downside due to continuations being split and being less susceptible for optimizations by the C compiler. For the optimizations my compiler is doing on the higher level, this is mostly irrelevant since optimizations happen on a direct-style lambda level before converting to the cps style for code generation. Could you pin down a bit better where the slow down in your case comes from? On which IR level is your scheme compiler performing its optimizations?

cont(foo) {
    prologue;

    old_write_barrier(obj, field, val); // C foreign call to old write barrier

    *sp++ = cont_to_val(next); // continue in next when new_write_barrier returns
    tailcall(new_write_barrier, obj, field, val); // tail call to new write barrier
}

cont(next) {
    prologue;
    ...
}
minad commented 4 years ago

Update: I did a few experiments with the safe-point write barrier. And it seems sometimes this can really hit pathological cases - many mutable objects being created and shared afterwards. This leads to many calls to the minor gc.

While the minor gc is pretty fast, it is tuned for performing its work in certain time intervals. In that pathological case it leads to excessive work, in particular since the minor heap is structured in multiple generations. This leads to prematurely promoting data to the next generation etc. I got up to 85% of the time being spent in minor gc in one case, only due to the promotions. Hopefully in your case, the excessive gc calls are not so expensive, since you only have one minor generation on the stack.

I am experimenting now with a third solution. Basically I tried all these now:

  1. Allocate mutable objects in nursery, use read barrier for mutable objects. My experience with this is good. However there is a penalty due to the read barrier and increased complexity. Write barriers are no safe points.
  2. Allocate mutable objects in nursery, enter minor gc as soon as mutable objects are promoted to the shared heap to fix the references. There are no read barriers, but write barriers must be safe points. The complexity is acceptable, but I had to restructure a few things to move write barriers in tail position everywhere. Unfortunately I am hitting pathological cases. This is also something I somehow suspected intuitively when I didn't consider this as a solution in the beginning.
  3. Allocate mutable objects directly in major-local heap. There are no read barriers, write barriers do not have to be safe points.

Overall the third solution is the simplest solution I came up with. It is similar to the original Doligez-Leroy proposal, where mutable objects are allocated directly in the shared heap. However the significant difference is that my non-moveable major heap is separated in major-local and major-shared. This allows to allocate in major-local without leading to unnecessary promotions, since references from major-local to the corrresponding local minor heap are allowed.

My heap structure looks as follows:

processor local/minor: nursery, tenure0, tenure1, ... major heap: major-local, major-shared

Promoting from the major-local heap to the major-shared heap just requires flipping a bit in the object header.

@justinethier Please let me know if I should stop spamming here ;) But it might still be interesting for you even if your design is quite a bit different. It certainly helps me thinking about things, when I write them down ;)

justinethier commented 4 years ago

Could you pin down a bit better where the slow down in your case comes from? On which IR level is your scheme compiler performing its optimizations?

Cyclone transforms everything to CPS and then optimizes that IR. The issue for our system is that the CPS cannot be optimized as well when these mutation operations require a continuation. It also has the downside of preventing other optimizations further downstream. For example, simple loops containing a mutation cannot be converted to single C-function loops right now. I suspect this could be overcome with some work, since in some cases the continuation will not be used.

Unfortunately I have a couple of issues in my runtime that are preventing this feature from being merged back for release, but I am hoping to get everything in place soon :)

Allocate mutable objects directly in major-local heap. There are no read barriers, write barriers do not have to be safe points.

Will you have a problem with objects allocated in the nursery that are then assigned to mutable objects, though? Basically the issue you pointed out originally. How do you deal with that?

Regarding the GC latency spikes with Cyclone, I added some more instrumentation and tracked that down: there are cases where a single minor GC can invoke many sweeps of heap sub-sections. I suspect this happens when the heap is mostly full and a single sweep does not free up much space. Will have to give some thought as to what can be done. It would be nice if there could be a hint that the heap section does not have many free slots left. The collector could probably keep track of this while tracing objects, since we know the bounds of each heap region and could correlate each object back to the section that owns it.

Also, you aren't spamming, please feel free to post here :). I like hearing about your GC, and hopefully my feedback is helpful. If nothing else this could be a helpful form of rubber ducking. GC's are a specialized topic anyway so it is good to be able to talk things through and share ideas.

minad commented 4 years ago

Cyclone transforms everything to CPS and then optimizes that IR. The issue for our system is that the CPS cannot be optimized as well when these mutation operations require a continuation. It also has the downside of preventing other optimizations further downstream.

Ok, I see!

For example, simple loops containing a mutation cannot be converted to single C-function loops right now. I suspect this could be overcome with some work, since in some cases the continuation will not be used.

This is an interesting optimization. For now I am not doing that since I wanted to have more safepoints. Writing things as a loop could lead to a thread being stuck for quite some time. In my case loops look like this:

cont(loop) {
    prologue; // this is a polling safepoint, might jump into the runtime if requested.
              // as an optimization, this could also be replaced by something like a `lea` instruction,
              // entering a signal handler via segfault.
    ...
    tailcall(loop);
}

I guess in cyclone there is no mechanism to interrupt threads from outside? All interrupts can be handled during stack unwinding. Is there no scenario where long-running loops create problems or do you have some other mechanism to interrupt loops and save the state somehow?

If I would want to optimize this in my case at some later point, I would probably try two things:

Allocate mutable objects directly in major-local heap. There are no read barriers, write barriers do not have to be safe points. Will you have a problem with objects allocated in the nursery that are then assigned to mutable objects, though? Basically the issue you pointed out originally. How do you deal with that?

The original issue was if I am assigning a mutable object A to another mutable object B living in the major-shared heap. Then A must be promoted to A', requiring adjustment of all local references to A, such that they point to A'. In 1. and 2. this issue is present. In 3. this issue is not present, since all mutable objects live in the major heap right away.

More in detail regarding 3. - Promotion of a mutable object A living in the major-local heap is necessary, if A is assigned to another object B living in major-shared. However promotion in this case will simply toggle a shared bit in the header of A. Additionally the whole transitive closure has to be promoted too. The crucial point here is that the object A is not moved during promotion, such that no reference adjustment is necessary.

Furthermore promotion of an immutable object A living in the minor heap is necessary, if A is assigned to another object B living in major-shared. Then a new object A' is created in major-shared and the references to A are adjusted during gc at some later point. The adjustment can be done later since A is immutable and it doesn't matter if references to both A and A' are alive at the same time.

Regarding the GC latency spikes with Cyclone, I added some more instrumentation and tracked that down: there are cases where a single minor GC can invoke many sweeps of heap sub-sections. I suspect this happens when the heap is mostly full and a single sweep does not free up much space.

Interesting, so the culprit is the sweeping.

Will have to give some thought as to what can be done. It would be nice if there could be a hint that the heap section does not have many free slots left. The collector could probably keep track of this while tracing objects, since we know the bounds of each heap region and could correlate each object back to the section that owns it.

How does it help if the collector knows how many free slots are left? Distribute some work more equally?

Also, you aren't spamming, please feel free to post here :). I like hearing about your GC, and hopefully my feedback is helpful. If nothing else this could be a helpful form of rubber ducking. GC's are a specialized topic anyway so it is good to be able to talk things through and share ideas.

Cool! Yes, explaining things is very helpful and maybe something can be learned along the way!

justinethier commented 4 years ago

I guess in cyclone there is no mechanism to interrupt threads from outside? All interrupts can be handled during stack unwinding. Is there no scenario where long-running loops create problems or do you have some other mechanism to interrupt loops and save the state somehow?

I should have clarified. The generated C code will check each iteration to see if our stack limit has been exceeded, and if so, the loop will be interrupted to perform GC. We then call back into the continuation that contains the loop so it can be resumed.

minad commented 4 years ago

@justinethier I think we can close this? My implementation seems to have converged more or less and I guess the same applies to Cyclone. Thank you very much for the discussion! If something interesting comes up, feel free to ping me!

I just stumbled upon some ocaml multicore update: Interestingly it seems ocaml multicore is reconsidering synchronized stop the world GCs. See https://discuss.ocaml.org/t/where-is-multicore-development-happening/4997/8