Closed wks closed 4 months ago
I ran lusearch from DaCapo Chopin. There are four builds: master and this PR, with lazy and eager sweeping for each of them. Three heap factors (3.023x, 3.678x, 4.392x w.r.t. G1 in OpenJDK), 10 invocations each. (plotty link)
The STW time of lazy sweeping is not obviously changed.
The mutator time is not obviously changed.
The mutator time of master-eager is significantly lower at the lowest heap factor. That is a result of too frequent GCs (as seen in the number of GC). Some mutator time is counted into STW time after one mutator triggers GC but before all mutators come to a stop, as seen in the timeline below. This behavior is only reproducible if I run multiple iterations, and the timeline below is collected with the -n5
cmdline option of DaCapo. It is normal with -n1
. It may be a result of some existing bugs that caused MarkSweep not to release some memory. When running pr-eager, -n10
still gives a reasonable result.
As a result, the difference in STW time is mainly a result of increased number of GC and the memory leak bug. The STW time per GC is also unreasonably high for master-eager.
The following timeline is from master-eager, lusearch. Although there are many mutators and ReleaseMutator
can be parallelized, the Release
work packet is busy releasing blocks in the abandoned list, blocking the ReleaseMutator
work packets.
This is pr-eager running lusearch. Using SweepChunk
is not faster than ReleaseMutator
. It is not that load-balanced if the number of chunks is small and some chunks contain unallocated blocks. And the sum of execution time of all SweepChunk
packets is greater than that of all ReleaseMutator
work packets.
This is master running Liquid using mmtk-ruby (must be eager). Because there is only one mutator, the ReleaseMutator
cannot be parallelized, and it spent most of the time doing sweeping.
This is this PR running Liquid. Sweeping is still slow because of naive_brute_force_sweep
, it is much more load-balanced.
SweepChunk
is not necessarily faster than doing sweeping in ReleaseMutator
(depending on work load), it avoids the pathological case of single-threaded applications.From the CI tests of the mmtk-ruby repo, this PR does not accelerate the test with MarkSweep. See the execution time of the "Build and test (release, MarkSweep)" of the following two PRs:
So the execution speed difference is more likely due to other factors, including the mutator speed (lack of allocation fast path) and GC speed (the time needed to run a whole GC), and the impact of the MemBalancer.
This PR re-introduce the
SweepChunk
work packet to the nativeMarkSweepSpace
when using eager sweeping. The main intention of this PS is fixing a pathological case where there is only one mutator and the singleReleaseMutator
work packet cannot be parallelized.The algorithm is unchanged for lazy sweeping.
When doing eager sweeping,
MarkSweepSpace::abandoned
and each mutator.SweepChunk
work packets.In step (1), we release blocks for each mutator in
ReleaseMutator
. Releasing blocks is very fast, so parallelism is not a bottleneck. During step (2), all blocks are either unallocated or marked, so we don't need to move any block out of linked lists, avoiding the data race we observed in https://github.com/mmtk/mmtk-core/issues/1103. Step (3) is done by one thread, but it is fast enough not to be a performance bottleneck.We also introduced a work packet
ReleaseMarkSweepSpace
which does whatMarkSweepSpace::release
did, but is executed concurrently withReleaseMutator
work packets. In this way, we don't do too much work in theRelease
work packet, which is a problem we discussed in https://github.com/mmtk/mmtk-core/issues/1147.We removed the constant
ABANDON_BLOCKS_IN_RESET
because it is obviously necessary to do so. Otherwise one mutator will keep too many blocks locally, preventing other mutators to get available blocks to use.We added an USDT tracepoint in
SweepChunk
in bothMarkSweepSpace
andImmixSpace
so that we can see the number of allocated blocks aSweepChunk
work packet visited in the timeline generated by eBPF tracing tools.We also changed
immixspace::SweepChunk
so that it now asserts that the chunk is allocated rather than skipping unallocated chunks.ChunkMap::generate_tasks
already filters out unallocated chunks.Fixes: https://github.com/mmtk/mmtk-core/issues/1146