mmtk / mmtk-core

Memory Management ToolKit
https://www.mmtk.io
Other
378 stars 69 forks source link

Performance regression on 20240802 for luindex #1200

Closed qinsoon closed 1 month ago

qinsoon commented 1 month ago

There is 10% performance slow down observed in our CI tests on 20240802 for luindex with OpenJDK GenImmix and StickyImmix:

The PR for the run should be https://github.com/mmtk/mmtk-core/pull/1182.

wks commented 1 month ago

It is interesting that it affects luindex so much, but not other benchmarks.

wks commented 1 month ago

Here are the eBPF trace of a nursery GC in the version right before https://github.com/mmtk/mmtk-core/pull/1182 was merged (I'll call it "no-assert" from now on) and the version right after it was merged ("yes-assert")

Version "no-assert":

image

Version "yes-assert":

image

The PR https://github.com/mmtk/mmtk-core/pull/1182 changes the way we generate "PrepareMutator" work packets.

-            for mutator in <C::VM as VMBinding>::VMActivePlan::mutators() {
-                mmtk.scheduler.work_buckets[WorkBucketStage::Prepare]
-                    .add(PrepareMutator::<C::VM>::new(mutator));
-            }
+            let prepare_mutator_packets = <C::VM as VMBinding>::VMActivePlan::mutators()
+                .map(|mutator| Box::new(PrepareMutator::<C::VM>::new(mutator)) as _)
+                .collect::<Vec<_>>();
+            // Just in case the VM binding is inconsistent about the number of mutators and the actual mutator list.
+            debug_assert_eq!(
+                prepare_mutator_packets.len(),
+                <C::VM as VMBinding>::VMActivePlan::number_of_mutators()
+            );
+            mmtk.scheduler.work_buckets[WorkBucketStage::Prepare].bulk_add(prepare_mutator_packets);

This change has two effects:

  1. We asserts that the number of work packets is equal to the number of mutators reported by the binding.
  2. We add all "PrepareMutator" work packets with one bulk_add instead of many add calls, reducing the number of synchronization operations.

But part (2) has one side effect: It wakes fewer GC workers, and those workers grab many work packets from the buckets. See the Perfetto plot above. The reason is that in our current GC worker implementation, the worker steals one batch at a time (in BucketQueue::steal_batch_and_pop). That was supposed to reduce the amount of synchronization when the bucket has many packets, and the load-balancing is good. However, because the number of work packets is small in the "Prepare" stage, a few GC workers grabbed most of the work packets, leaving none for others. You can compare the two plots above.

And one work packet in the "Prepare" stage is very unbalanced. That was "ScanCodeCacheRoots". It scanned 26845 code cache root slots, and dominated the "Prepare" stage. (FYI, A derived "GenNurseryProcessEdges" work packet contains 26845 slots from the code cache roots and dominated the "Closure" stage.) Before the PR https://github.com/mmtk/mmtk-core/pull/1182, that work packet is usually picked up by a separate GC worker which had just done the PrepareMutator packet which was the only packet added at that time. After that PR, however, there was usually one GC worker that picks up many other work packets together with "ScanCodeCacheRoots". See the circle in the second plot above. That increased the critical path of work packets in the "Prepare" stage.

In fact, it affects other plans (including SemiSpace) too. But it may not be that obvious because a full-heap GC will spend more time in the closure stage, so the overhead is not that obvious. See the plot below

image

The recently merged PR https://github.com/mmtk/mmtk-openjdk/pull/282 makes code cache roots generational, and allows code cache roots to be split into multiple "ProcessEdgesWork" work packet, each having at most 4096 elements. Nursery GCs usually have zero newly added code cache roots, and the cost is completely eliminated. See the plot below.

image

For full-heap GCs, it greatly improved the load balance of the "Closure" stage, but does not help reducing the latency of the "Prepare" stage. See the plot below.

image

And the reason is that ScanCodeCacheRoots is still one work packet that scans all code cache roots. The fact that code cache roots are stored in two HashSet data structures makes it impossible to parallelize the scanning.

In conclusion, the root causes are

  1. the extra-large ScanCodeCacheRoots work packet, and
  2. the "batch-stealing" behavior of the work-stealing library

Both of them contribute to the load-unbalancing of the "Prepare" stage.

eBPF traces

Here are some eBPF trace logs for those who are interested. You can load them into Perfetto UI

In the following logs, the plan is GenImmix unless explicitly labelled. They record one of every 10 GCs.

And the master branch, but recording every single GC.

About the added assertion

The assertion has no impact on the nursery GCs we observed. Those nursery GCs usually don't generate any SweepChunk work packets at all, leaving nothing to assert.

qinsoon commented 1 month ago

If I understand you right, our GC worker always steal a batch of packets, and when there aren't many packets, one worker may steall all the packets. The expected behavior would be that other workers will starve and steal packets from the worker, and eventually it makes the load balanced. Does that stealing actually happen? Is it costly to steal?

qinsoon commented 1 month ago

steal_batch would steal around half of the tasks, as specified in https://docs.rs/crossbeam-deque/latest/crossbeam_deque/struct.Stealer.html#method.steal_batch.

We may be able to use priorized queue for ScanCodeCacheRoots so some workers will start working on it from the beginning while other workers can work on other packets.

wks commented 1 month ago

If I understand you right, our GC worker always steal a batch of packets, and when there aren't many packets, one worker may steall all the packets. The expected behavior would be that other workers will starve and steal packets from the worker, and eventually it makes the load balanced. Does that stealing actually happen? Is it costly to steal?

In theory, they should be able to steal from other workers. But in practice, it seems to take longer for a thread to wake up due to OS scheduling.

image

After the "Prepare" work packet which added the "PrepareCollector" work packets which are "designated work". All workers should wake up and find a "designated work" to do. However, some workers wake up earlier, and some workers wake up later. When a late worker wakes up (the ones that only got a "PrepareCollector" to execute), they wake up so late that other workers have already finished executing other work packets, which are usually small root-scanning packets that can be executed very quickly. That leaves them nothing to steal.

From the eBPF plot, the latency of some PrepareCollector packets is 56 µs. Not sure if this is the latency Linux is capable of scheduling at, but if a system is running other processes, some processors may be running other tasks and don't have a chance to run as a GCWorker.

wks commented 1 month ago

FYI, we discussed the OS scheduling problem in a previous issue: https://github.com/mmtk/mmtk-core/issues/1160

qinsoon commented 1 month ago

There seems to be several different issues:

  1. notify_all in bulk_add vs notify_one in add. When we do notify_all, and all the workers that are waken up will have contention on the same work queue, we may see worse latency.
  2. Workers starve. When workers are idle and we add a small number of work packets with bulk_add, certain workers just won't get any packet. steal_batch just makes the situation worse. We probably should be careful when using bulk_add instead of add if the number of packets is small.
  3. The foreseeable long running packet may need to be put to the priority queue. An available worker will work on a priority packet, while leaving other packets to other workers.
wks commented 1 month ago

I took another look at the regression on 2024-08-02.

Looking at the original log again

The raw execution logs from 2024-08-02 and the one immediately before that can be found in

The luindex benchmark is from DaCapo 2006, with 6x min heap size (22MB * 6 = 132MB), executed with 20 invocation, with 2 iterations each. It is not using probes, so it only shows total execution time of the last iteration, in the form of "===== DaCapo luindex PASSED in xxxxxx msec =====".

From the log, we can find the mean execution time

date mean (ms) std (ms)
2024-07-26 2211.8 8.87
2024-08-02 2393.0 16.91

The change of mean execution time is +8.19%.

Re-running on groundhog.moma

I re-ran luindex on groundhog.moma (the same CPU as hammer.moma and wrench.moma). It used GenImmix, 132MB heap size, 20 invocations, 2 iterations each, just like on hammer and wrench.

The "build3" is the commit for https://github.com/mmtk/mmtk-core/pull/1182. "build2" and "build1" are the two commits before that, and "build4" is the commit after that. If "build3" caused the regression, we should see the total time of build3 and build4 being similar, but obviously higher than build1 and build2.

I tried it three times, but the results were inconsistent.

The first attempt (20 invocations)

image

The second attempt (another 20 invocations, same configuration)

image

The third attempt (another 20 invocations, same configuration)

image

Plotty links:

The result does not show if "build3" regressed.

wks commented 1 month ago

This time, I ran luindex for 40 invocations (instead of 20), with 5 iterations each (instead of 2). The result is:

image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|groundhog-2024-09-23-Mon-120221&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|41&Histogram%20(with%20CI)^build^benchmark&

A result like this shows that https://github.com/mmtk/mmtk-core/pull/1182 didn't cause the regression. And it is more likely that there are other factors (environment changes) that impacted the performance.

caizixian commented 1 month ago

@wks if you take a look at the log file, is there anything different in terms of kernel versions, CPU frequencies, etc before and after the regression?

qinsoon commented 1 month ago

Then it was probably from "Microcode update for Zenbleed" (https://github.com/anupli/momas/issues/14) which was applied to our machines on 2 Aug.

wks commented 1 month ago

@wks if you take a look at the log file, is there anything different in terms of kernel versions, CPU frequencies, etc before and after the regression?

The kernel versions are the same, and the CPU frequencies only differ by 0.8% (the frequency is actually faster after the regression). There is no microcode version logged, though. Maybe we should print /proc/cpuinfo (which contains the microcode version and the CPU frequencies) into the log, too.

qinsoon commented 1 month ago

I assume we can close this. It is not a performance regression, but an CI environment change.