Closed wks closed 5 months ago
The ImmixSpace
flushes the BlockPageResource
when all the SweepChunk
work packets finish. This is hard to implement without significant changes because the ReleaseMutator
work packet is created by the plan-agnostic Release
work packet, without any context about the MarkSweepSpace
. It is hard to intercept the creation of ReleaseMutator
work packet and inject an Arc<FlushPageResource>
into each of them. I just did what the MarkSweepSpace
already does for flushing the abandoned_in_gc
lists -- flushing it in end_of_gc
.
The time taken to execute end_of_gc
is trivial for SemiSpace and Immix (about 1us). It is still trivial (but not that trivial) for MarkSweep (8us). I'll take a look at whether it spent the time flushing the abandoned_in_gc
or flushing the BlockPageResource
.
The time spent flushing the abandoned_in_gc
lists is slightly more than flushing the BlockPageResource. I think it depends on the ratio of marked blocks and unmarked blocks. In this particular GC (when running MarkSweep, lusearch, 72M heap size), it took 9.691us to flush them all.
When the heap is bigger, the time to flush will increase. The following timeline is from the PMD benchmark (600M heap size). The time to flush abandoned_in_gc
is longer. The time to flush the BlockPageResource remain constant. I think that's because the block queue in the BlockPageResource flushes itself from time to time whenever it overflows. That increased the chance of contention, but reduced the time to flush in the end of GC.
The
ImmixSpace
flushes theBlockPageResource
when all theSweepChunk
work packets finish. This is hard to implement without significant changes because theReleaseMutator
work packet is created by the plan-agnosticRelease
work packet, without any context about theMarkSweepSpace
. It is hard to intercept the creation ofReleaseMutator
work packet and inject anArc<FlushPageResource>
into each of them. I just did what theMarkSweepSpace
already does for flushing theabandoned_in_gc
lists -- flushing it inend_of_gc
.The time taken to execute
end_of_gc
is trivial for SemiSpace and Immix (about 1us). It is still trivial (but not that trivial) for MarkSweep (8us). I'll take a look at whether it spent the time flushing theabandoned_in_gc
or flushing theBlockPageResource
.
Why can't we use a sentinel to flush it at the end?
Why can't we use a sentinel to flush it at the end?
There are two difficulties.
(1) To know if it is at the end, we need to know the number of units of work beforehand. In ImmixSpace
, the number of units is the number of SweepChunk
work packets. The number is known in generate_sweep_tasks
(tasks.len()
)
fn generate_sweep_tasks(&self) -> Vec<Box<dyn GCWork<VM>>> {
self.defrag.mark_histograms.lock().clear();
// # Safety: ImmixSpace reference is always valid within this collection cycle.
let space = unsafe { &*(self as *const Self) };
let epilogue = Arc::new(FlushPageResource {
space,
counter: AtomicUsize::new(0),
});
let tasks = self.chunk_map.generate_tasks(|chunk| {
Box::new(SweepChunk {
space,
chunk,
epilogue: epilogue.clone(),
})
});
epilogue.counter.store(tasks.len(), Ordering::SeqCst);
tasks
}
In MarkSweepSpace
, the number of units of work is the number of ReleaseMutator
work packets plus the Release
work packet itself. We could count the number of mutators, but as we discussed in https://github.com/mmtk/mmtk-core/issues/1146, for load balancing, it's better to break up each ReleaseMutator
into fine-grained work packets, and the number of those work packets is not known until all ReleaseMutator
packets have iterated through the block lists of all mutators.
We can't let some threads (such as workers executing ReleaseMutator
) increment the tasks count (by generating fine-grained work packets) while other threads decrement the tasks count (by completing the fine-grained work packets). In that case, the counter will oscillate between 0 and positive numbers. As a result, the BlockPageResource may end up being flushed multiple times, and two threads may flush at the same time. But BockPageResource::flush_all()
is not thread-safe.
(2) In order to use the sentinel, we need to inject an Arc<FlushPageResource>
into each unit of work. In ImmixSpace
, generate_sweep_tasks
generates all SweepChunk
work packets, and assigns the epilogue
field of each of them. But since ReleaseMutator
is created in Release::do_work
(which is plan-agnostic), we can't intercept it and add a field to the ReleaseMutator
work packet (which is also plan-agnostic). We could embed the FlushPageResource
data structure in MarkSweepSpace
, but that still has the first problem I mentioned above.
So the easiest way to do it is do it in end_of_gc
like how MarkSweep
currently flushes abandoned_in_gc
.
Alternatively, we can make a "sentinel" that is executed when the Release
bucket is drained. That should be just as fast (or as slow) as doing it in end_of_gc
.
Performance evaluation:
Three builds:
master
: The master branch.masterfix
: The master branch, but count the time of Plan.end_of_gc
into GC time, just like this PR.pr
: This PR.I added the masterfix
build because this PR changed the way we measure GC (STW) time. The masterfix
will be the representative of the old MarkSweepSpace
that used the raw FreeListPageResource
, and the difference between the masterfix
and the pr
build will be the effect of using BlockPageResource
instead.
I ran lusearch
from DaCapo Chopin on lynx.moma (only using the 8 big cores). I used 4 heap sizes: 51MB, 64MB, 77MB and 92MB, which are 2.428x, 3.023x, 3.678x and 4.392x min heap size w.r.t. G1 in OpenJDK. 10 invocations for each heap size. (running runbms -i 10 ... 8 3 4 5 6
) I included both the MarkSweep and Immix plans.
Here are the line plots (plotty link). Each line is a build-plan tuple. The x axis is the heap factor (N/1000 min heap size). The "geomean" near 0 should be "2428", and that's probably a bug in plotty.
We can see a tiny amount of difference between the builds for MarkSweep, but almost no difference for Immix.
The number of GC overflowed and stuck at 2048 at the smallest heap factor.
The time.other
is much lower at the smallest heap factor for MarkSweep. If the number is correct, it means mutators run faster when the heap is small. This is reproducible on my laptop, too. I'll investigate if that's another performance anomaly of MarkSweep.
Here is a zoomed-in line plot for time.stw
.
And the histogram of time.stw
(only MarkSweep) normalized to masterfix
. (plotty link)
When the heap is small, the noise is large; but as the heap grows larger, this PR has smaller and smaller STW time. That's because the larger the heap is, the more blocks there are to sweep, and the more blocks there are to release. The lock contention problem with FreeListPageResource::release_pages
becomes more serious, while advantage of the lock-free method BlockPageResource::release_block
becomes greater.
FYI, the STW time when running Immix has no differences beyond noise between the three builds. Here is a histogram. (plotty link)
I'll try running the pmd
benchmark which has a much larger min heap size than lusearch
.
The improvement looks marginal. I wonder if it is because you were using only 8 GC threads and there wasn't much contention in the release phase. Did you check the pathological case you found in https://github.com/mmtk/mmtk-core/issues/1145? Does this PR improve it? Nonetheless, I think switching to BlockPageResource
is the right thing to do.
The improvement looks marginal. I wonder if it is because you were using only 8 GC threads and there wasn't much contention in the release phase. Did you check the pathological case you found in #1145? Does this PR improve it? Nonetheless, I think switching to
BlockPageResource
is the right thing to do.
It was 16 GC threads. I used taskset -c 0-15
because each big core on lynx.moma
has two SMT threads, appearing to the OS as two processors.
The pathological case in #1145 was using 100M heap size, so the difference is greater in that run.
master 72M (ran on my laptop):
master 100M (you can see the time to release mutators became much longer than 72M):
And yes. This PR improves the speed of releasing mutators at 100M, too.
The pmd
benchmark, with 7 different heap sizes ranging from 358M to 1134M, which are 1.892x to 6.000x min heap size w.r.t. G1in OpenJDK. It ran 20 invocations per heap size instead of 10.
Line plots: (plotty link)
Histogram (MarkSweep-only), normalized to masterfix
The result isn't as exciting as I expected. At 6x min heap size, this PR only has 6%-7% STW time reduction. It may be because pmd
doesn't have such a high heap turnaround ratio as lusearch
. In other words, it tends to retain more blocks rather than recycling them.
We observe a small increase in STW time from 1.8x to 3x min heap sizes, as a result of increased number of GC. It may be a result of BlockPageResource
not unmapping memory, making the heap tighter for other spaces (such as LOS), therefore triggering GC more often. There is a comment in BlockPageResource::flush_all
:
pub fn flush_all(&self) {
self.block_queue.flush_all()
// TODO: For 32-bit space, we may want to free some contiguous chunks.
}
I ran many other benchmarks in DaCapo Chopin (excluding some that are known to produce inconsistent results as we increase the number of iterations, likely due to the lack of reference processing). I ran with two builds (masterfix
and pr
), the MarkSweep, with three heap factors (3.023x, 3.678x and 4.392x).
There are too many plots, so I only post Plotty links.
masterfix
masterfix
Most benchmarks have reduced STW time with this PR.
The benchmark that has the highest STW time reduction is graphchi, with more than 60% reduction at 4.392x min heap size (w.r.t. G1 in OpenJDK). eBPF timeline shows graphchi spent a lot of time releasing mutators, likely because it simply has too many blocks to release in each GC, and the effect is amplified by the lock contention in masterfix
.
masterfix:
pr:
The time to compute the transitive closure is also reduced, and it is likely to be a result of BlockPageResource reusing existing blocks, but I am not certain about it.
There are some benchmarks that have increased STW time instead. They are cassandra, fop, jython and spring. Their error bars are large, and I cannot reproduce their slowdown on my laptop. The number of GC is small for both cassandra and fop (see this plotty link for the number of GC), and the GCs concentrate near the end of the benchmark execution. Such benchmarks are more susceptible to random effects. For fop, some GCs may spend 50% of the time tracing descendants of finalizable objects. So the variation of STW time is more related to when and how much finalization happened instead of the time spent executing ReleaseMutator. See the timeline below
The effect on total time is observable, but not that dramatic.
Overall, I think this PR is effective in reducing the time of ReleaseMutator, especially in the pathological cases where the heap is large and the number of blocks to be released is high.
There are too many plots, so I only post Plotty links.
The performance looks good. We see up to 10% improvement of STW time in geomean across all the measured benchmarks in 3x/3.6x/4.3x heap size, about 1% improvement in total time. You can merge the PR at any time.
Let
MarkSweepSpace
use theBlockPageResource
instead of the rawFreeListPageResource
. This solves a problem where multiple GC workers contend for the mutexFreeListPageResource::sync
when callingFreeListPageResource::release_pages
concurrently during theRelease
stage.We flush the
BlockPageResource
inMarkSweepSpace::end_of_gc
. To monitor the performance, we added a pair of USDT trace points before and after callingPlan.end_of_gc
. We also count the invocation ofend_of_gc
into the GC time.We changed
BlockQueue
so that it usesMaybeUninit
to hold uninitialized elements instead of relying onB::from_aligned_address(Address::ZERO)
because theBlock
type used byMarkSweepSpace
is based onNonZeroUsize
, and cannot be zero.Fixes: https://github.com/mmtk/mmtk-core/issues/1145