The curious case of the abrupt compiler slowdown

retronym commented 6 years ago

@jvican and @vjovanov and @rorygraves have independently reported that in long-running benchmark runs, scalac can abruptly slow down and settle into that plateau of slower compile times.

I'm collecting the evidence and analysis here.

@rorygraves plotted:

@vjovanov writes:

Running benchmarks on scalac 2.12.6 now. I have noticed a strange thing, after many iterations (>1000) the benchmark performance drops significantly (~10%) and stays there. No JIT compilations happen at that time. The same happens with GraalVM only after 1600 iterations. Have you experienced this? Any ideas what could cause it?

I am writing because this could happen to people that keep the SBT session open for a long time, which is a typical use case I hope.

Here is an excerpt from the benchmark output (-XX:+PrintCompilation is enabled). All input/output is in memory, turbo is disabled, and the process is pinned to the core where the memory file system is allocated.
====== Scalac212 (scalac), iteration 1125 completed (1292.786 ms) ======
====== Scalac212 (scalac), iteration 1126 started ======
====== Scalac212 (scalac), iteration 1126 completed (1351.291 ms) ======
====== Scalac212 (scalac), iteration 1127 started ======
====== Scalac212 (scalac), iteration 1127 completed (1381.79 ms) ======
====== Scalac212 (scalac), iteration 1128 started ======
====== Scalac212 (scalac), iteration 1128 completed (1390.495 ms) ======
====== Scalac212 (scalac), iteration 1129 started ======
====== Scalac212 (scalac), iteration 1129 completed (1380.719 ms) ======
====== Scalac212 (scalac), iteration 1130 started ======
====== Scalac212 (scalac), iteration 1130 completed (1387.678 ms) ======
====== Scalac212 (scalac), iteration 1131 started ======
====== Scalac212 (scalac), iteration 1131 completed (1382.225 ms) ======
====== Scalac212 (scalac), iteration 1132 started ======
====== Scalac212 (scalac), iteration 1132 completed (1487.32 ms) ======
====== Scalac212 (scalac), iteration 1133 started ======
====== Scalac212 (scalac), iteration 1133 completed (1507.681 ms) ======
====== Scalac212 (scalac), iteration 1134 started ======
====== Scalac212 (scalac), iteration 1134 completed (1502.993 ms) ======
====== Scalac212 (scalac), iteration 1135 started ======
====== Scalac212 (scalac), iteration 1135 completed (1546.353 ms) ======
====== Scalac212 (scalac), iteration 1136 started ======
====== Scalac212 (scalac), iteration 1136 completed (1503.672 ms) ======
Here is the setup I use. Download the tar from the following link https://drive.google.com/open?id=1yCRKqeI3oeynpiG7dKJho0CNwXytLw7j

When you untar, in the bundle folder you have a shell script renaissance. You cad modify the script and add -XX:+PrintGC -XX:+PrintCompilation as well as -Dbenchmark.outdir= and -Dbenchmark.dir= if you want it running on a RAM-based disk. To run:
  ./renaissance Scalac212 --pre-iteration-gc -r 2000
The machine is:
 Two sockets with Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
 378 GB of RAM
 Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
 Kernel version is 4.1.12-3    7.5.1.el6uek.x86_64
I run all with

numactl --cpunodebind=0 --localalloc

and on node 0 I also mount the RAM disk although this issue happens also on an SSD.

The benchmark file is attached and it uses the modified version of compiler-benchmark which is compiling scalap with scalac 2.12:

https://github.com/scala/compiler-benchmark/pull/65

It happened once to me that the issue did not show up.

The following zip contains JFR logs and perf stat results for the time during slowdown and normal compilation:

It seems that the slowdown comes from the reduced parallelism level. Can I disable parallelism in scalac somehow to verify the results? If this is possible we should also use that for all benchmarks.

Command: perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses,bus-cycles,faults,migrations,context-switches
Fast:
   Performance counter stats for process id '59489':

     521463.920880      task-clock (msec)         #    5.540 CPUs utilized            (100.00%)
    11,887,705,092      cache-references          #   22.797 M/sec                    (85.71%)
       327,366,547      cache-misses              #    2.754 % of all cache refs      (57.15%)
 1,184,417,117,705      cycles                    #    2.271 GHz                      (71.42%)
   662,768,984,755      instructions              #    0.56  insns per cycle          (85.71%)
   111,637,012,930      branches                  #  214.084 M/sec                    (85.71%)
     2,116,802,307      branch-misses             #    1.90% of all branches          (85.71%)
    51,542,545,359      bus-cycles                #   98.842 M/sec                    (85.72%)
           139,273      faults                    #    0.267 K/sec                    (100.00%)
             1,005      migrations                #    0.002 K/sec                    (100.00%)
            27,726      context-switches          #    0.053 K/sec                  

      94.127725605 seconds time elapsed

Slow:
 Performance counter stats for process id '41280':

     312731.344629      task-clock (msec)         #    5.002 CPUs utilized            (100.00%)
     6,795,034,858      cache-references          #   21.728 M/sec                    (85.71%)
       183,152,774      cache-misses              #    2.695 % of all cache refs      (57.16%)
   710,429,380,282      cycles                    #    2.272 GHz                      (71.44%)
   385,386,357,573      instructions              #    0.54  insns per cycle          (85.72%)
    64,975,594,069      branches                  #  207.768 M/sec                    (85.71%)
     1,225,295,662      branch-misses             #    1.89% of all branches          (85.73%)
    30,918,428,952      bus-cycles                #   98.866 M/sec                    (85.70%)
            16,167      faults                    #    0.052 K/sec                    (100.00%)
               595      migrations                #    0.002 K/sec                    (100.00%)
            13,220      context-switches          #    0.042 K/sec                  

      62.524185630 seconds time elapsed

Analysis

Scalac is serial by default. Recent builds feature an option to run the backend in parallel. So the other threads in play come from the VM or benchmarking infrastructure. Flight Recorder only shows Java threads, so we don't see GC or JIT activity. I sometimes use async-profiler to see VM threads as well.

The flight recorder profiles you provided to include a few samples from other threads.

Fast:

Thread  Thread Group    Profiling Samples   Total I/O Time  I/O Count   Total Blocked Time  Blocked Count   Class Loading Time  Class Loading Count Total Allocation    Throwables  Thread Start    Thread End
main    main    6214                            2.8756192848E10 B           
JFR request timer   main    13                          4.2949776E7 B           
Reference Handler   system  7

Slow:

Thread  Thread Group    Profiling Samples   Total I/O Time  I/O Count   Total Blocked Time  Blocked Count   Class Loading Time  Class Loading Count Total Allocation    Throwables  Thread Start    Thread End
main    main    6253                            2.585350976E10 B            
Reference Handler   system  26                                      
JFR request timer   main    7                           4.2949776E7 B

The Reference Handler thread might be relevant. scalac employs a weak hash set to hash-cons all Type-s it creates. This could be a source of inter-run performance effects.

We don't eagerly drain the reference queue when we discard the Global or the Run. There are two difficulties here. First, Global and Run don't have close() methods, so we need to add these but still deal with old callers who won't call them. Second, when I once tried to register this map for clearing at the start of the next run, but hit a test failure.

Now, this might all turn out to be a red herring, but its somewhere to start.

[ ] Add a close() method to Global (and maybe Run) that drains the uniques queue and clears any other per-run caches
[ ] Figure out the test failure
[ ] Call this from compiler-benchmark (adding a same-named extension method to older Globals for cross compatibility)
[ ] See if the problem persists

retronym commented 6 years ago

Hmm, taking a closer look, the profile samples from the "Reference Handler" thread in both recordings are only in the first few seconds after attach, so they might be an artifact of flight recorder turning on.

Furthermore, @vjovanov's benchmark is doing full GC before each run, so it probably doesn't matter for this test whether we drain the queue manually or not. Does it ever matter? Or am I getting confused with finalizers?

retronym commented 6 years ago

As expected, JIT is basically inactive in both recordings, according to the compilation events in the profiles.

I think the 5x vs 5.4x core utilization could only be coming from the GC, and the bulk of that should be from the explicit System.gc calls that are outside of the measurements above.

Puzzling stuff.

retronym commented 6 years ago

Other random ideas:

Sleep for a bit between iterations (is a core heating up and downscaling its frequency? the perf stats don't seem to support this theory, though.)
You've experienced this on HotSpot with both Graal and C2. Did you see the same on Substrate VM?
Try -XX:+UseSerialGC. That should take the core utilization down to 1 (apart from the Reference Queue thread, I guess), both during the workload and in the explicit System.gc calls in between.
Gather the fast/slow profiles with async-profiler rather than Flight Recorder to see what's happening on non-Java threads.
Control for other machine level effects by measuring both the scalac time and some other fixed workload (e.g. some SPEC benchmark) in each iteration? Do both slow down, or just scalac?
... or by running a second instance of the same test, pinned to a different socket of cores, side-by-side. Do they both slow down at the same time?

retronym commented 6 years ago

Here's a zoomed in look at a single instance of the GC and compile cycle:

It spends 550ms doing the explicit, parallel GC, reporting a machine utilization of 13.8% (If all 8/64 pinned cores are participating, that would be 12.5%). Then the workload drops to a single thread (the next sample of the machine CPU is 1.4%, about 1/64.

So, if something causes the serial scalac to get slower, it is natural that perf will report reduced core utilization for a recording, as proportionally longer time is spent single threaded.

rorygraves commented 6 years ago

I think we can exclude CPU throttling. The 6 runs (of which two slowed down) ran continuously ~6+ hours. So (admittedly I only have 2 slowdowns recorded), if it was throttling I would expect to see much more random slowdowns rather than a stepchange near the end of two runs.

rorygraves commented 6 years ago

Just reread your comment and saw " the perf stats don't seem to support this theory, though" - d'oh.

retronym commented 6 years ago

I couldn't hurt to log the processor frequency and/or tempature between iterations with something like https://unix.stackexchange.com/questions/264632/what-is-the-correct-way-to-view-your-cpu-speed-on-linux

rorygraves commented 6 years ago

Good plan.

On 24 May 2018, at 09:08, Jason Zaugg notifications@github.com wrote:

I couldn't hurt to log the processor frequency and/or tempature between iterations with something like https://unix.stackexchange.com/questions/264632/what-is-the-correct-way-to-view-your-cpu-speed-on-linux

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

vjovanov commented 6 years ago

OK, I have: 1) Tried other benchmarks (dotty and collection operations) and the issue does not appear. 2) Ran with -XX:+UseSerialGC with both pre iteration GC and without and it still happens. Results can be found here:

https://drive.google.com/open?id=1x8LUvIT1KDOMKvG_xqN9IlgIsJymfO_9

The only thing I can see is that when the slowdown happens we have fewer instructions per cycle. Now, at least the JFR logs and `perf stat` are not polluted with the parallel GC. What could cause the lower IPC when the cache hit ratio is the same?

3) The frequency scaling is disabled, and several different 72 core machines are unlikely to overheat when only one core is running. Perf also measures the frequency and it is practically the same.

I will try it on SVM as well.

retronym commented 6 years ago

Reproducing the relevant perf output here to save folks a few clicks:

Fast:
 Performance counter stats for process id '47798':

      41958.315519      task-clock (msec)         #    1.013 CPUs utilized            (100.00%)
     2,011,054,294      cache-references          #   47.930 M/sec                    (85.74%)
       183,346,591      cache-misses              #    9.117 % of all cache refs      (57.19%)
    95,220,084,700      cycles                    #    2.269 GHz                      (71.49%)
   125,964,611,325      instructions              #    1.32  insns per cycle          (85.80%)
    24,605,487,877      branches                  #  586.427 M/sec                    (85.79%)
       408,773,005      branch-misses             #    1.66% of all branches          (85.66%)
     4,145,778,472      bus-cycles                #   98.807 M/sec                    (85.62%)
            55,564      faults                    #    0.001 M/sec                    (100.00%)
                16      migrations                #    0.000 K/sec                    (100.00%)
             3,818      context-switches          #    0.091 K/sec                  

      41.419963778 seconds time elapsed

Slow:
 Performance counter stats for process id '47798':

      50301.128765      task-clock (msec)         #    1.019 CPUs utilized            (100.00%)
     2,082,679,231      cache-references          #   41.404 M/sec                    (85.75%)
       193,405,386      cache-misses              #    9.286 % of all cache refs      (57.11%)
   114,133,639,188      cycles                    #    2.269 GHz                      (71.51%)
   135,359,551,402      instructions              #    1.19  insns per cycle          (85.75%)
    26,384,438,882      branches                  #  524.530 M/sec                    (85.74%)
       441,713,725      branch-misses             #    1.67% of all branches          (85.69%)
     4,970,158,143      bus-cycles                #   98.808 M/sec                    (85.72%)
            11,708      faults                    #    0.233 K/sec                    (100.00%)
                27      migrations                #    0.001 K/sec                    (100.00%)
             4,102      context-switches          #    0.082 K/sec                  

      49.361826802 seconds time elapsed

retronym commented 6 years ago

Following the Top Down Analysis Technique with VTune or Oracle Performance Studio could probably shine a light on what's changed.

If you have either of those tools at hand and could record the fast/slow profiles, either we'll be able to figure out the problem or nerd-snipe someone into helping out.

We could first try to get a broader (full?) set of top-level hardware counters with perf. Maybe something is hiding behind the apparently-unchanged "cache-misses" ratio. e.g, what if the instruction cache hit rate suffers, which causes a big IPC change without making a discernable change in the overall cache-misses stat.

vjovanov commented 6 years ago

First, I will see if the same problem happens in SBT when the builds slow down after a while. Then, I will try with VTune. This will take me a few days.

mkeskells commented 6 years ago

The graph that @rorygraves produced was from sbt runs @vjovanov most OS will move processes around to avoid overhead

I am not sure that we are looking at the same issue. @rorygraves saw the throughput roughly halve between runs

SethTisue commented 5 years ago

@retronym where do we stand here? should this remain open?

retronym commented 5 years ago

It remains curious. I'll close on the assumption the investigations have petered out.

scala / scala-dev

The curious case of the abrupt compiler slowdown #518

Analysis