Let MarkSweepSpace use BlockPageResource

mmtk / mmtk-core

Memory Management ToolKit

https://www.mmtk.io

Other

379 stars 69 forks source link

Let MarkSweepSpace use BlockPageResource #1150

Closed wks closed 5 months ago

wks commented 5 months ago

Let MarkSweepSpace use the BlockPageResource instead of the raw FreeListPageResource. This solves a problem where multiple GC workers contend for the mutex FreeListPageResource::sync when calling FreeListPageResource::release_pages concurrently during the Release stage.

We flush the BlockPageResource in MarkSweepSpace::end_of_gc. To monitor the performance, we added a pair of USDT trace points before and after calling Plan.end_of_gc. We also count the invocation of end_of_gc into the GC time.

We changed BlockQueue so that it uses MaybeUninit to hold uninitialized elements instead of relying on B::from_aligned_address(Address::ZERO) because the Block type used by MarkSweepSpace is based on NonZeroUsize, and cannot be zero.

Fixes: https://github.com/mmtk/mmtk-core/issues/1145

wks commented 5 months ago

The ImmixSpace flushes the BlockPageResource when all the SweepChunk work packets finish. This is hard to implement without significant changes because the ReleaseMutator work packet is created by the plan-agnostic Release work packet, without any context about the MarkSweepSpace. It is hard to intercept the creation of ReleaseMutator work packet and inject an Arc<FlushPageResource> into each of them. I just did what the MarkSweepSpace already does for flushing the abandoned_in_gc lists -- flushing it in end_of_gc.

The time taken to execute end_of_gc is trivial for SemiSpace and Immix (about 1us). It is still trivial (but not that trivial) for MarkSweep (8us). I'll take a look at whether it spent the time flushing the abandoned_in_gc or flushing the BlockPageResource.

wks commented 5 months ago

The time spent flushing the abandoned_in_gc lists is slightly more than flushing the BlockPageResource. I think it depends on the ratio of marked blocks and unmarked blocks. In this particular GC (when running MarkSweep, lusearch, 72M heap size), it took 9.691us to flush them all.

When the heap is bigger, the time to flush will increase. The following timeline is from the PMD benchmark (600M heap size). The time to flush abandoned_in_gc is longer. The time to flush the BlockPageResource remain constant. I think that's because the block queue in the BlockPageResource flushes itself from time to time whenever it overflows. That increased the chance of contention, but reduced the time to flush in the end of GC.

k-sareen commented 5 months ago

The ImmixSpace flushes the BlockPageResource when all the SweepChunk work packets finish. This is hard to implement without significant changes because the ReleaseMutator work packet is created by the plan-agnostic Release work packet, without any context about the MarkSweepSpace. It is hard to intercept the creation of ReleaseMutator work packet and inject an Arc<FlushPageResource> into each of them. I just did what the MarkSweepSpace already does for flushing the abandoned_in_gc lists -- flushing it in end_of_gc.

The time taken to execute end_of_gc is trivial for SemiSpace and Immix (about 1us). It is still trivial (but not that trivial) for MarkSweep (8us). I'll take a look at whether it spent the time flushing the abandoned_in_gc or flushing the BlockPageResource.

Why can't we use a sentinel to flush it at the end?

wks commented 5 months ago

Why can't we use a sentinel to flush it at the end?

There are two difficulties.

(1) To know if it is at the end, we need to know the number of units of work beforehand. In ImmixSpace, the number of units is the number of SweepChunk work packets. The number is known in generate_sweep_tasks (tasks.len())

    fn generate_sweep_tasks(&self) -> Vec<Box<dyn GCWork<VM>>> {
        self.defrag.mark_histograms.lock().clear();
        // # Safety: ImmixSpace reference is always valid within this collection cycle.
        let space = unsafe { &*(self as *const Self) };
        let epilogue = Arc::new(FlushPageResource {
            space,
            counter: AtomicUsize::new(0),
        });
        let tasks = self.chunk_map.generate_tasks(|chunk| {
            Box::new(SweepChunk {
                space,
                chunk,
                epilogue: epilogue.clone(),
            })
        });
        epilogue.counter.store(tasks.len(), Ordering::SeqCst);
        tasks
    }

In MarkSweepSpace, the number of units of work is the number of ReleaseMutator work packets plus the Release work packet itself. We could count the number of mutators, but as we discussed in https://github.com/mmtk/mmtk-core/issues/1146, for load balancing, it's better to break up each ReleaseMutator into fine-grained work packets, and the number of those work packets is not known until all ReleaseMutator packets have iterated through the block lists of all mutators.

We can't let some threads (such as workers executing ReleaseMutator) increment the tasks count (by generating fine-grained work packets) while other threads decrement the tasks count (by completing the fine-grained work packets). In that case, the counter will oscillate between 0 and positive numbers. As a result, the BlockPageResource may end up being flushed multiple times, and two threads may flush at the same time. But BockPageResource::flush_all() is not thread-safe.

(2) In order to use the sentinel, we need to inject an Arc<FlushPageResource> into each unit of work. In ImmixSpace, generate_sweep_tasks generates all SweepChunk work packets, and assigns the epilogue field of each of them. But since ReleaseMutator is created in Release::do_work (which is plan-agnostic), we can't intercept it and add a field to the ReleaseMutator work packet (which is also plan-agnostic). We could embed the FlushPageResource data structure in MarkSweepSpace, but that still has the first problem I mentioned above.

So the easiest way to do it is do it in end_of_gc like how MarkSweep currently flushes abandoned_in_gc.

Alternatively, we can make a "sentinel" that is executed when the Release bucket is drained. That should be just as fast (or as slow) as doing it in end_of_gc.

wks commented 5 months ago

Performance evaluation:

Three builds:

master: The master branch.
masterfix: The master branch, but count the time of Plan.end_of_gc into GC time, just like this PR.
pr: This PR.

I added the masterfix build because this PR changed the way we measure GC (STW) time. The masterfix will be the representative of the old MarkSweepSpace that used the raw FreeListPageResource, and the difference between the masterfix and the pr build will be the effect of using BlockPageResource instead.

I ran lusearch from DaCapo Chopin on lynx.moma (only using the 8 big cores). I used 4 heap sizes: 51MB, 64MB, 77MB and 92MB, which are 2.428x, 3.023x, 3.678x and 4.392x min heap size w.r.t. G1 in OpenJDK. 10 invocations for each heap size. (running runbms -i 10 ... 8 3 4 5 6) I included both the MarkSweep and Immix plans.

Here are the line plots (plotty link). Each line is a build-plan tuple. The x axis is the heap factor (N/1000 min heap size). The "geomean" near 0 should be "2428", and that's probably a bug in plotty.

We can see a tiny amount of difference between the builds for MarkSweep, but almost no difference for Immix.

The number of GC overflowed and stuck at 2048 at the smallest heap factor.

The time.other is much lower at the smallest heap factor for MarkSweep. If the number is correct, it means mutators run faster when the heap is small. This is reproducible on my laptop, too. I'll investigate if that's another performance anomaly of MarkSweep.

Here is a zoomed-in line plot for time.stw.

And the histogram of time.stw (only MarkSweep) normalized to masterfix. (plotty link)

When the heap is small, the noise is large; but as the heap grows larger, this PR has smaller and smaller STW time. That's because the larger the heap is, the more blocks there are to sweep, and the more blocks there are to release. The lock contention problem with FreeListPageResource::release_pages becomes more serious, while advantage of the lock-free method BlockPageResource::release_block becomes greater.

FYI, the STW time when running Immix has no differences beyond noise between the three builds. Here is a histogram. (plotty link)

I'll try running the pmd benchmark which has a much larger min heap size than lusearch.

qinsoon commented 5 months ago

The improvement looks marginal. I wonder if it is because you were using only 8 GC threads and there wasn't much contention in the release phase. Did you check the pathological case you found in https://github.com/mmtk/mmtk-core/issues/1145? Does this PR improve it? Nonetheless, I think switching to BlockPageResource is the right thing to do.

wks commented 5 months ago

The improvement looks marginal. I wonder if it is because you were using only 8 GC threads and there wasn't much contention in the release phase. Did you check the pathological case you found in #1145? Does this PR improve it? Nonetheless, I think switching to BlockPageResource is the right thing to do.

It was 16 GC threads. I used taskset -c 0-15 because each big core on lynx.moma has two SMT threads, appearing to the OS as two processors.

The pathological case in #1145 was using 100M heap size, so the difference is greater in that run.

master 72M (ran on my laptop):

master 100M (you can see the time to release mutators became much longer than 72M):

And yes. This PR improves the speed of releasing mutators at 100M, too.

wks commented 5 months ago

The pmd benchmark, with 7 different heap sizes ranging from 358M to 1134M, which are 1.892x to 6.000x min heap size w.r.t. G1in OpenJDK. It ran 20 invocations per heap size instead of 10.

Line plots: (plotty link)

Histogram (MarkSweep-only), normalized to masterfix

The result isn't as exciting as I expected. At 6x min heap size, this PR only has 6%-7% STW time reduction. It may be because pmd doesn't have such a high heap turnaround ratio as lusearch. In other words, it tends to retain more blocks rather than recycling them.

We observe a small increase in STW time from 1.8x to 3x min heap sizes, as a result of increased number of GC. It may be a result of BlockPageResource not unmapping memory, making the heap tighter for other spaces (such as LOS), therefore triggering GC more often. There is a comment in BlockPageResource::flush_all:

    pub fn flush_all(&self) {
        self.block_queue.flush_all()
        // TODO: For 32-bit space, we may want to free some contiguous chunks.
    }

wks commented 5 months ago

I ran many other benchmarks in DaCapo Chopin (excluding some that are known to produce inconsistent results as we increase the number of iterations, likely due to the lack of reference processing). I ran with two builds (masterfix and pr), the MarkSweep, with three heap factors (3.023x, 3.678x and 4.392x).

There are too many plots, so I only post Plotty links.

Most benchmarks have reduced STW time with this PR.

The benchmark that has the highest STW time reduction is graphchi, with more than 60% reduction at 4.392x min heap size (w.r.t. G1 in OpenJDK). eBPF timeline shows graphchi spent a lot of time releasing mutators, likely because it simply has too many blocks to release in each GC, and the effect is amplified by the lock contention in masterfix.

masterfix:

pr:

The time to compute the transitive closure is also reduced, and it is likely to be a result of BlockPageResource reusing existing blocks, but I am not certain about it.

There are some benchmarks that have increased STW time instead. They are cassandra, fop, jython and spring. Their error bars are large, and I cannot reproduce their slowdown on my laptop. The number of GC is small for both cassandra and fop (see this plotty link for the number of GC), and the GCs concentrate near the end of the benchmark execution. Such benchmarks are more susceptible to random effects. For fop, some GCs may spend 50% of the time tracing descendants of finalizable objects. So the variation of STW time is more related to when and how much finalization happened instead of the time spent executing ReleaseMutator. See the timeline below

The effect on total time is observable, but not that dramatic.

Overall, I think this PR is effective in reducing the time of ReleaseMutator, especially in the pathological cases where the heap is large and the number of blocks to be released is high.

qinsoon commented 5 months ago

There are too many plots, so I only post Plotty links.

STW time, not normalized

STW time, normalized to masterfix

total time, not normalized

total time, normalized to masterfix

The performance looks good. We see up to 10% improvement of STW time in geomean across all the measured benchmarks in 3x/3.6x/4.3x heap size, about 1% improvement in total time. You can merge the PR at any time.