wks commented 4 months ago

This PR re-introduce the SweepChunk work packet to the native MarkSweepSpace when using eager sweeping. The main intention of this PS is fixing a pathological case where there is only one mutator and the single ReleaseMutator work packet cannot be parallelized.

The algorithm is unchanged for lazy sweeping.

When doing eager sweeping,

We first release all unmarked blocks in MarkSweepSpace::abandoned and each mutator.
And then we sweep blocks in parallel using SweepChunk work packets.
We then go through all "consumed" blocks and move them into "available" lists if they have any free cells.

In step (1), we release blocks for each mutator in ReleaseMutator. Releasing blocks is very fast, so parallelism is not a bottleneck. During step (2), all blocks are either unallocated or marked, so we don't need to move any block out of linked lists, avoiding the data race we observed in https://github.com/mmtk/mmtk-core/issues/1103. Step (3) is done by one thread, but it is fast enough not to be a performance bottleneck.

We also introduced a work packet ReleaseMarkSweepSpace which does what MarkSweepSpace::release did, but is executed concurrently with ReleaseMutator work packets. In this way, we don't do too much work in the Release work packet, which is a problem we discussed in https://github.com/mmtk/mmtk-core/issues/1147.

We removed the constant ABANDON_BLOCKS_IN_RESET because it is obviously necessary to do so. Otherwise one mutator will keep too many blocks locally, preventing other mutators to get available blocks to use.

We added an USDT tracepoint in SweepChunk in both MarkSweepSpace and ImmixSpace so that we can see the number of allocated blocks a SweepChunk work packet visited in the timeline generated by eBPF tracing tools.

We also changed immixspace::SweepChunk so that it now asserts that the chunk is allocated rather than skipping unallocated chunks. ChunkMap::generate_tasks already filters out unallocated chunks.

Fixes: https://github.com/mmtk/mmtk-core/issues/1146

wks commented 4 months ago

DaCapo

I ran lusearch from DaCapo Chopin. There are four builds: master and this PR, with lazy and eager sweeping for each of them. Three heap factors (3.023x, 3.678x, 4.392x w.r.t. G1 in OpenJDK), 10 invocations each. (plotty link)

The STW time of lazy sweeping is not obviously changed.

The mutator time is not obviously changed.

The mutator time of master-eager is significantly lower at the lowest heap factor. That is a result of too frequent GCs (as seen in the number of GC). Some mutator time is counted into STW time after one mutator triggers GC but before all mutators come to a stop, as seen in the timeline below. This behavior is only reproducible if I run multiple iterations, and the timeline below is collected with the -n5 cmdline option of DaCapo. It is normal with -n1. It may be a result of some existing bugs that caused MarkSweep not to release some memory. When running pr-eager, -n10 still gives a reasonable result.

As a result, the difference in STW time is mainly a result of increased number of GC and the memory leak bug. The STW time per GC is also unreasonably high for master-eager.

eBPF

The following timeline is from master-eager, lusearch. Although there are many mutators and ReleaseMutator can be parallelized, the Release work packet is busy releasing blocks in the abandoned list, blocking the ReleaseMutator work packets.

This is pr-eager running lusearch. Using SweepChunk is not faster than ReleaseMutator. It is not that load-balanced if the number of chunks is small and some chunks contain unallocated blocks. And the sum of execution time of all SweepChunk packets is greater than that of all ReleaseMutator work packets.

This is master running Liquid using mmtk-ruby (must be eager). Because there is only one mutator, the ReleaseMutator cannot be parallelized, and it spent most of the time doing sweeping.

This is this PR running Liquid. Sweeping is still slow because of naive_brute_force_sweep, it is much more load-balanced.

Conclusion

There may be some bug in the master branch that caused MarkSweepSpace to not making some memory available for the mutators, causing the heap to become tighter as we increase the number of iterations. This PR solves that problem.
Although SweepChunk is not necessarily faster than doing sweeping in ReleaseMutator (depending on work load), it avoids the pathological case of single-threaded applications.

wks commented 4 months ago

From the CI tests of the mmtk-ruby repo, this PR does not accelerate the test with MarkSweep. See the execution time of the "Build and test (release, MarkSweep)" of the following two PRs:

Last PR that upgrades mmtk-ruby to the latest mmtk-core master: https://github.com/mmtk/mmtk-ruby/pull/78
This PR: https://github.com/mmtk/mmtk-ruby/pull/79

So the execution speed difference is more likely due to other factors, including the mutator speed (lack of allocation fast path) and GC speed (the time needed to run a whole GC), and the impact of the MemBalancer.

mmtk / mmtk-core

Add SweepChunk to native MarkSweepSpace #1158

DaCapo

eBPF

Conclusion