renaissance-benchmarks / renaissance

The Renaissance Benchmark Suite
https://renaissance.dev
GNU General Public License v3.0
312 stars 60 forks source link

large variance in exec time for scala-doku #175

Open albertnetymk opened 5 years ago

albertnetymk commented 5 years ago

The following are two separate runs of java -jar renaissance-mit-0.10.0.jar scala-doku. It seems that for run B, JIT made a wrong turn in the second iteration, and never managed to get back on track.

run A

====== scala-doku (scala-sat), iteration 0 started ======
====== scala-doku (scala-sat), iteration 0 completed (6099.883 ms) ======
====== scala-doku (scala-sat), iteration 1 started ======
====== scala-doku (scala-sat), iteration 1 completed (5517.723 ms) ======
====== scala-doku (scala-sat), iteration 2 started ======
====== scala-doku (scala-sat), iteration 2 completed (5452.386 ms) ======
====== scala-doku (scala-sat), iteration 3 started ======
====== scala-doku (scala-sat), iteration 3 completed (5258.975 ms) ======
====== scala-doku (scala-sat), iteration 4 started ======
====== scala-doku (scala-sat), iteration 4 completed (5272.757 ms) ======
====== scala-doku (scala-sat), iteration 5 started ======
====== scala-doku (scala-sat), iteration 5 completed (5216.844 ms) ======
====== scala-doku (scala-sat), iteration 6 started ======
====== scala-doku (scala-sat), iteration 6 completed (5195.208 ms) ======
====== scala-doku (scala-sat), iteration 7 started ======
====== scala-doku (scala-sat), iteration 7 completed (5644.733 ms) ======
====== scala-doku (scala-sat), iteration 8 started ======
====== scala-doku (scala-sat), iteration 8 completed (5643.653 ms) ======
====== scala-doku (scala-sat), iteration 9 started ======
====== scala-doku (scala-sat), iteration 9 completed (5458.51 ms) ======
====== scala-doku (scala-sat), iteration 10 started ======
====== scala-doku (scala-sat), iteration 10 completed (5551.477 ms) ======
====== scala-doku (scala-sat), iteration 11 started ======
====== scala-doku (scala-sat), iteration 11 completed (5798.539 ms) ======
====== scala-doku (scala-sat), iteration 12 started ======
====== scala-doku (scala-sat), iteration 12 completed (5871.424 ms) ======
====== scala-doku (scala-sat), iteration 13 started ======
====== scala-doku (scala-sat), iteration 13 completed (5874.387 ms) ======
====== scala-doku (scala-sat), iteration 14 started ======
====== scala-doku (scala-sat), iteration 14 completed (5880.167 ms) ======
====== scala-doku (scala-sat), iteration 15 started ======
====== scala-doku (scala-sat), iteration 15 completed (5858.208 ms) ======
====== scala-doku (scala-sat), iteration 16 started ======
====== scala-doku (scala-sat), iteration 16 completed (5885.238 ms) ======
====== scala-doku (scala-sat), iteration 17 started ======
====== scala-doku (scala-sat), iteration 17 completed (5909.086 ms) ======
====== scala-doku (scala-sat), iteration 18 started ======
====== scala-doku (scala-sat), iteration 18 completed (5838.492 ms) ======
====== scala-doku (scala-sat), final iteration started ======
====== scala-doku (scala-sat), final iteration completed (5891.406 ms) ======

run B

====== scala-doku (scala-sat), iteration 0 started ======
====== scala-doku (scala-sat), iteration 0 completed (6757.863 ms) ======
====== scala-doku (scala-sat), iteration 1 started ======
====== scala-doku (scala-sat), iteration 1 completed (9237.812 ms) ======
====== scala-doku (scala-sat), iteration 2 started ======
====== scala-doku (scala-sat), iteration 2 completed (8455.805 ms) ======
====== scala-doku (scala-sat), iteration 3 started ======
====== scala-doku (scala-sat), iteration 3 completed (8400.774 ms) ======
====== scala-doku (scala-sat), iteration 4 started ======
====== scala-doku (scala-sat), iteration 4 completed (8465.861 ms) ======
====== scala-doku (scala-sat), iteration 5 started ======
====== scala-doku (scala-sat), iteration 5 completed (8416.82 ms) ======
====== scala-doku (scala-sat), iteration 6 started ======
====== scala-doku (scala-sat), iteration 6 completed (9099.18 ms) ======
====== scala-doku (scala-sat), iteration 7 started ======
====== scala-doku (scala-sat), iteration 7 completed (8914.351 ms) ======
====== scala-doku (scala-sat), iteration 8 started ======
====== scala-doku (scala-sat), iteration 8 completed (8955.402 ms) ======
====== scala-doku (scala-sat), iteration 9 started ======
====== scala-doku (scala-sat), iteration 9 completed (8827.685 ms) ======
====== scala-doku (scala-sat), iteration 10 started ======
====== scala-doku (scala-sat), iteration 10 completed (8884.836 ms) ======
====== scala-doku (scala-sat), iteration 11 started ======
====== scala-doku (scala-sat), iteration 11 completed (8911.605 ms) ======
====== scala-doku (scala-sat), iteration 12 started ======
====== scala-doku (scala-sat), iteration 12 completed (8890.457 ms) ======
====== scala-doku (scala-sat), iteration 13 started ======
====== scala-doku (scala-sat), iteration 13 completed (8959.936 ms) ======
====== scala-doku (scala-sat), iteration 14 started ======
====== scala-doku (scala-sat), iteration 14 completed (8922.854 ms) ======
====== scala-doku (scala-sat), iteration 15 started ======
====== scala-doku (scala-sat), iteration 15 completed (8954.935 ms) ======
====== scala-doku (scala-sat), iteration 16 started ======
====== scala-doku (scala-sat), iteration 16 completed (8920.92 ms) ======
====== scala-doku (scala-sat), iteration 17 started ======
====== scala-doku (scala-sat), iteration 17 completed (8881.392 ms) ======
====== scala-doku (scala-sat), iteration 18 started ======
====== scala-doku (scala-sat), iteration 18 completed (8909.44 ms) ======
====== scala-doku (scala-sat), final iteration started ======
====== scala-doku (scala-sat), final iteration completed (9045.501 ms) ======

ENV:

openjdk 12.0.1 2019-04-16 Debian GNU/Linux 10 (buster) 2 core (2 hyperthreading/core) Mem: 11G

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               69
Model name:          Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
Stepping:            1
CPU MHz:             1380.740
CPU max MHz:         3300.0000
CPU min MHz:         800.0000
BogoMIPS:            5387.92
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt dtherm ida arat pln pts
davleopo commented 5 years ago

Hi @albertnetymk ,

thanks for sharing your results. It is interesting to see that on your machine you encounter bi-modal performance for this benchmark.

I did not witness this behavior when adding the benchmark originally ( https://github.com/renaissance-benchmarks/renaissance/pull/173) , on my machine. As you suggested this may be related to a sub-optimal decision made by the JIT compiler. If it is a JIT decision causing this behavior, I would not consider this a bug but rather a nice side-effect of the benchmark exhibiting the need to adapt/tune the heuristic in the compiler causing these extreme performance cliffs.

Have you tried running more executions of the entire benchmark, like 10-20? Would be interesting to see what the probability of the 2 performance levels is. Additionally, have you tried minimizing external influences of your machine/setup in order to stabilize results? This may include heap sizes, the usage of ram discs, reducing influences from other processes running on your machine, frequency scaling, turbo boost, etc?

Thank you! david

albertnetymk commented 5 years ago

Listing the final iteration exec time for 14 consecutive invocations of java -jar renaissance-mit-0.10.0.jar scala-doku:

====== scala-doku (scala-sat), final iteration completed (5891.406 ms) ======
====== scala-doku (scala-sat), final iteration completed (5971.683 ms) ======
====== scala-doku (scala-sat), final iteration completed (8892.878 ms) ======
====== scala-doku (scala-sat), final iteration completed (5899.568 ms) ======
====== scala-doku (scala-sat), final iteration completed (5934.718 ms) ======
====== scala-doku (scala-sat), final iteration completed (4971.463 ms) ======
====== scala-doku (scala-sat), final iteration completed (9045.501 ms) ======
====== scala-doku (scala-sat), final iteration completed (5865.651 ms) ======
====== scala-doku (scala-sat), final iteration completed (9253.71 ms) ======
====== scala-doku (scala-sat), final iteration completed (8939.641 ms) ======
====== scala-doku (scala-sat), final iteration completed (5860.301 ms) ======
====== scala-doku (scala-sat), final iteration completed (8831.876 ms) ======
====== scala-doku (scala-sat), final iteration completed (5891.929 ms) ======
====== scala-doku (scala-sat), final iteration completed (8726.487 ms) ======

The JVM flags used (by default) are made explicit via -XX:+PrintCommandLineFlags, and they are: -XX:G1ConcRefinementThreads=4 -XX:GCDrainStackTargetSize=64 -XX:InitialHeapSize=194809728 -XX:MaxHeapSize=3116955648 -XX:+PrintCommandLineFlags -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC

Benchmarking is done on my laptop, so not very rigorous testing env... I have left it idle while running the benchmark at least, and turbo boost is on.

roookeee commented 5 years ago

Running the benchmarks with anything but a fixed cpu speed isn't great, running it on a laptop that throttles aggressively which probably uses and Intel CPU which has it's max turbo boost timebound (instead of temperature) is worse - please redo it on a properly cooled desktop pc and if possible with a fixed frequency that doesn't cause throttling. A variance from 10-40% is to be expected (given the benchmark conditions)

Theodor commented 5 years ago

I do see this kind of bimodal behavior for 'scala-doku' on dedicated bare-metal performance machines (with all kinds of possible preparation done). On a variety of JITs it shows "stably bi-modal" behavior, dropping into bad score on 1-2 iteration. Have not yet investigated the exact compilation that causes that though ...