Closed retronym closed 5 years ago
Hmm, taking a closer look, the profile samples from the "Reference Handler" thread in both recordings are only in the first few seconds after attach, so they might be an artifact of flight recorder turning on.
Furthermore, @vjovanov's benchmark is doing full GC before each run, so it probably doesn't matter for this test whether we drain the queue manually or not. Does it ever matter? Or am I getting confused with finalizers?
As expected, JIT is basically inactive in both recordings, according to the compilation events in the profiles.
I think the 5x vs 5.4x core utilization could only be coming from the GC, and the bulk of that should be from the explicit System.gc
calls that are outside of the measurements above.
Puzzling stuff.
Other random ideas:
-XX:+UseSerialGC
. That should take the core utilization down to 1 (apart from the Reference Queue thread, I guess), both during the workload and in the explicit System.gc
calls in between.async-profiler
rather than Flight Recorder to see what's happening on non-Java threads.scalac
time and some other fixed workload (e.g. some SPEC benchmark) in each iteration? Do both slow down, or just scalac
?Here's a zoomed in look at a single instance of the GC and compile cycle:
It spends 550ms doing the explicit, parallel GC, reporting a machine utilization of 13.8% (If all 8/64 pinned cores are participating, that would be 12.5%). Then the workload drops to a single thread (the next sample of the machine CPU is 1.4%, about 1/64.
So, if something causes the serial scalac to get slower, it is natural that perf will report reduced core utilization for a recording, as proportionally longer time is spent single threaded.
I think we can exclude CPU throttling. The 6 runs (of which two slowed down) ran continuously ~6+ hours. So (admittedly I only have 2 slowdowns recorded), if it was throttling I would expect to see much more random slowdowns rather than a stepchange near the end of two runs.
Just reread your comment and saw " the perf stats don't seem to support this theory, though" - d'oh.
I couldn't hurt to log the processor frequency and/or tempature between iterations with something like https://unix.stackexchange.com/questions/264632/what-is-the-correct-way-to-view-your-cpu-speed-on-linux
Good plan.
On 24 May 2018, at 09:08, Jason Zaugg notifications@github.com wrote:
I couldn't hurt to log the processor frequency and/or tempature between iterations with something like https://unix.stackexchange.com/questions/264632/what-is-the-correct-way-to-view-your-cpu-speed-on-linux
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
OK, I have:
1) Tried other benchmarks (dotty
and collection operations) and the issue does not appear.
2) Ran with -XX:+UseSerialGC
with both pre iteration GC and without and it still happens. Results can be found here:
https://drive.google.com/open?id=1x8LUvIT1KDOMKvG_xqN9IlgIsJymfO_9
The only thing I can see is that when the slowdown happens we have fewer instructions per cycle. Now, at least the JFR logs and `perf stat` are not polluted with the parallel GC. What could cause the lower IPC when the cache hit ratio is the same?
3) The frequency scaling is disabled, and several different 72 core machines are unlikely to overheat when only one core is running. Perf also measures the frequency and it is practically the same.
I will try it on SVM as well.
Reproducing the relevant perf
output here to save folks a few clicks:
Fast:
Performance counter stats for process id '47798':
41958.315519 task-clock (msec) # 1.013 CPUs utilized (100.00%)
2,011,054,294 cache-references # 47.930 M/sec (85.74%)
183,346,591 cache-misses # 9.117 % of all cache refs (57.19%)
95,220,084,700 cycles # 2.269 GHz (71.49%)
125,964,611,325 instructions # 1.32 insns per cycle (85.80%)
24,605,487,877 branches # 586.427 M/sec (85.79%)
408,773,005 branch-misses # 1.66% of all branches (85.66%)
4,145,778,472 bus-cycles # 98.807 M/sec (85.62%)
55,564 faults # 0.001 M/sec (100.00%)
16 migrations # 0.000 K/sec (100.00%)
3,818 context-switches # 0.091 K/sec
41.419963778 seconds time elapsed
Slow:
Performance counter stats for process id '47798':
50301.128765 task-clock (msec) # 1.019 CPUs utilized (100.00%)
2,082,679,231 cache-references # 41.404 M/sec (85.75%)
193,405,386 cache-misses # 9.286 % of all cache refs (57.11%)
114,133,639,188 cycles # 2.269 GHz (71.51%)
135,359,551,402 instructions # 1.19 insns per cycle (85.75%)
26,384,438,882 branches # 524.530 M/sec (85.74%)
441,713,725 branch-misses # 1.67% of all branches (85.69%)
4,970,158,143 bus-cycles # 98.808 M/sec (85.72%)
11,708 faults # 0.233 K/sec (100.00%)
27 migrations # 0.001 K/sec (100.00%)
4,102 context-switches # 0.082 K/sec
49.361826802 seconds time elapsed
Following the Top Down Analysis Technique with VTune or Oracle Performance Studio could probably shine a light on what's changed.
If you have either of those tools at hand and could record the fast/slow profiles, either we'll be able to figure out the problem or nerd-snipe someone into helping out.
We could first try to get a broader (full?) set of top-level hardware counters with perf. Maybe something is hiding behind the apparently-unchanged "cache-misses" ratio. e.g, what if the instruction cache hit rate suffers, which causes a big IPC change without making a discernable change in the overall cache-misses stat.
First, I will see if the same problem happens in SBT when the builds slow down after a while. Then, I will try with VTune. This will take me a few days.
The graph that @rorygraves produced was from sbt runs @vjovanov most OS will move processes around to avoid overhead
I am not sure that we are looking at the same issue. @rorygraves saw the throughput roughly halve between runs
@retronym where do we stand here? should this remain open?
It remains curious. I'll close on the assumption the investigations have petered out.
@jvican and @vjovanov and @rorygraves have independently reported that in long-running benchmark runs,
scalac
can abruptly slow down and settle into that plateau of slower compile times.I'm collecting the evidence and analysis here.
@rorygraves plotted:
@vjovanov writes:
Analysis
Scalac is serial by default. Recent builds feature an option to run the backend in parallel. So the other threads in play come from the VM or benchmarking infrastructure. Flight Recorder only shows Java threads, so we don't see GC or JIT activity. I sometimes use async-profiler to see VM threads as well.
The flight recorder profiles you provided to include a few samples from other threads.
Fast:
Slow:
The Reference Handler thread might be relevant. scalac employs a weak hash set to hash-cons all Type-s it creates. This could be a source of inter-run performance effects.
We don't eagerly drain the reference queue when we discard the Global or the Run. There are two difficulties here. First, Global and Run don't have close() methods, so we need to add these but still deal with old callers who won't call them. Second, when I once tried to register this map for clearing at the start of the next run, but hit a test failure.
Now, this might all turn out to be a red herring, but its somewhere to start.
Global
(and maybeRun
) that drains the uniques queue and clears any other per-run caches