benchmark driver with chrome 83 causes very large variance

krausest commented 4 years ago

Benchmark 07 has a massive issue for benchmark 07 (create 10,000 rows). Compare the std deviation for Chrome 81 with Chrome 83: part2 Some framework (but not all !) show a massive increase for create 10,000 rows.

Vanillajs is not affected, react is badly hit: vanillajs duration 1012.33 upper bound 1089.351 duration 971.895 upper bound 1018.114 duration 1021.275 upper bound 1123.55 duration 1004.614 upper bound 1144.641 duration 1083.901 upper bound 1282.512 duration 1073.189 upper bound 1201.911 duration 1063.792 upper bound 1267.276 duration 1000.341 upper bound 1051.175 duration 1026.629 upper bound 1139.84 duration 1014.036 upper bound 1148.692

react duration 1996.347 upper bound 2069.631 duration 2178.687 upper bound 2449.581 duration 2793.722 upper bound 2950.74 duration 3199.166 upper bound 3393.684 duration 3422.698 upper bound 3550.979 duration 9042.936 upper bound 9357.742 duration 5520.162 upper bound 5850.739 duration 5799.582 upper bound 5953.318 duration 13869.262 upper bound 14293.366 duration 15228.996 upper bound 15726.095

When looking at the timeline in chrome I often see durations like 2.3 secs, but never values as slow as above.

Often, but not always append rows to large table is also affected (e.g. not for crui, elm or hyperapp)

krausest commented 4 years ago

When I don't re-use the chrome driver for the CPU benchmarks the std deviation goes down for react and react-djinn completes again. This of course means that running all benchmarks will (persumably) take much longer. I think I have to go this way since all other attempts to find the culprit failed. Removing all chrome options or the performance logging didn't remove the issue - so I've no idea how to come to another workaround. A possible option would be to reuse the chrome driver for all benchmarks but 7 and 8, but that doesn't feel right either.

krausest commented 4 years ago

Here's a comparison between the ways to handle chromedriver for three frameworks.

On the left is the old approach. For each CPU benchmark one chrome driver is created and the all benchmark iterations are performed in that instance. On the right is an approach where each benchmark iteration forks a new subprocess that creates a new chromedriver. (The forking appears to be necessary to eliminate a strange bug. Sometimes chromedriver reports the error "Server terminated early with status 1". I added a retry loop for that situation but it turned out that it seems like that error can't be recovered. It'll occur again even when a new chromedriver instance is created. I haven't see this happening when forking the process that creates the chromedriver yet, so maybe this is a fix for it too.)

It's clear to see that the create 10,000 rows is affected by this bug but only for react and angular.
What's interesting is that react is much slower when forking for select rows (but please note that the columns have different order). Why does that happen? It turns out that the first run for select rows is particularly bad for react (example: 51.545, 36.878, 40.385, 34.208, 35.38, 32.396, 36.072, 32.969, 32.185, 36.256). When forking you'll always get numbers like the first one.

The other benchmarks look similar for both modes though some have a rather high variance, but on https://krausest.github.io/js-framework-benchmark/2020/table_chrome_83.html some frameworks like ember have the issue for benchmark append 1,000 to large table too.

I'm looking for opinions (@leeoniya @ryansolid ...) Where to go from here? I could either switch to forking for all benchmarks, but this makes the benchmark run much longer. Even on linux I think I can only run the keyed section during night. I don't even want to think about what this means to running on windows. The other option would to fork only benchmarks that have this issue for any framework.

krausest commented 4 years ago

The react finding looks like a warmup effect that's exposed when forking.

Warmup runs	mean	median	stdev
1	64.4	64.0	2.8
4	47.1	44.6	5.2
6	55.9	54.0	6.0
7	35.1	33.2	4.2
8	32.5	32.0	3.2
10	35.2	33.4	4.5
20	31.2	31.0	3.9
40	33.5	33.2	2.9

Of course the default so far was ... 6 :joy: I'll take a look at other frameworks and see if they share the warmup effect.

In addition to react the following frameworks show that effect: angular-ng, neverland, raw-act, rax, reaml, sifrr and vue-next

brainkim commented 4 years ago

One thing I’ve discovered while messing around with the benchmarking code. Based on experimentation with my framework I suspected the variance in the rowlots test in Chrome 83 was because of GC pressure that was happening when tests weren’t run in isolation. I found that for some reason we force a GC before each memory test according to these lines of code: https://github.com/krausest/js-framework-benchmark/blob/a349f948ecba530f9197f36c0bbdef76779093b5/webdriver-ts/src/forkedBenchmarkRunner.ts#L323-L325

but not before CPU tests.

By removing the condition and forcing a GC before all benchmarks, I was able to get React to behave like it did in Chrome 81.

Before: Screen Shot 2020-05-25 at 2 26 34 PM After: Screen Shot 2020-05-25 at 2 27 14 PM

leeoniya commented 4 years ago

Where to go from here? I could either switch to forking for all benchmarks, but this makes the benchmark run much longer. Even on linux I think I can only run the keyed section during night. I don't even want to think about what this means to running on windows. The other option would to fork only benchmarks that have this issue for any framework.

man, this & the warmups thing really opens up a new can of worms.

i'm now thinking it makes logical sense to fork and disable warmups completely for initial "create" & startup-like metrics, since no one will be loading an app with a warmed up JIT/cache. even 1 warmup round is misleading, let alone some magical 7. this will certainly impact the variance negatively especially for frameworks which already dont do well here, but that's just reality, which the current strategy is not.

krausest commented 4 years ago

Here are the results for the keyed section when forking the chromedrier, but still using wamup: https://krausest.github.io/js-framework-benchmark/2020/table_chrome_83_forked_only_keyed.html Any comments on that? (I used only 1 iteration for the memory benchmarks. I should update the table and remove the std error for those results.)

This night is reserved for a run without warmups...

krausest commented 4 years ago

@brainkim Thanks - that's an excellent catch. Here's my analysis for react-hooks and benchmark 7 (create 10,000 rows):

mode	mean	std dev
Chrome 81 (no forking)	1657.51	74.3
Chrome 83 (no forking)	4450.6	3377.4
Chrome 83 (no forking, force GC)	1464.1	105.1
Chrome 83 (forking)	1652.3	13.2
Chrome 83 (no forking, wait 2 seconds)	1504.5	79.4

So indeed forcing GC brings us close (or maybe even to) the old behaviour. What's even more surprising is that just waiting 2 seconds in each iteration (without forking and without forcing GC) also gets us close to the old results. Seems like the benchmark runner is running with too little pauses and creating too much preasure on the GC. This would also explain why I couldn't see that behaviour in the timeline when manually testing. I'll see if I can trace GC times and confirm the results.

Any thoughts on that?

krausest commented 4 years ago

@brainkim ForceGC was intentionally not called for the CPU benchmarks. I wanted to include GC in those measurements (and in the end punish implementations with high GC pressure). Since memory benchmarks works by comparing the heap size it's necessary to call it for the memory benchmarks.

brainkim commented 4 years ago

@krausest

I wanted to include GC in those measurements (and in the end punish implementations with high GC pressure)

Yes I assumed you had good reasons for excluding forced GCs from the CPU tests. I didn’t get that initBenchmark ran for each run of each benchmark test; I thought it ran once between each benchmark test, so I am unclear about what you should do. In a sense, you are doing exactly what you intended and punishing frameworks with high GC pressure with the benchmarks as they currently stand, but maybe the problem is the punishment is too harsh?

All of this performance stuff is new to me, but one thing I’ve learned is that CPU and memory are inextricably linked, and in actuality, execution time of your code matters less than allocating less memory, because past a certain threshold, you pay for the excess memory allocations with random minor GCs. I would therefore argue that most of the frameworks which are at 1.5+ geometric slowdown in CPU are already being punished for GC pressure, and doing a GC between each run only prevents huge major GCs from happening while the benchmark is running, which shouldn’t ever really happen in any actual application (maybe during page navigation between two big views?)

I’m not sure what the best solution is.

krausest commented 4 years ago

Just for the record. Here are the timeline entries for a fast and a very slow run. Seems like there's nothing traced that would to confirm it's a GC issue. Maybe I'll have to play with the traceCategories....


one iteration with 1641 msecs
{"method":"Tracing.dataCollected","params":{"args":{"data":{"type":"click"}},"cat":"devtools.timeline","dur":752701,"name":"EventDispatch","ph":"X","pid":2452101,"tdur":752019,"tid":1,"ts":175708667988,"tts":1951381}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"columnNumber":52644,"frame":"F24A992E3DB781CA978E666125B65C79","functionName":"mr","lineNumber":1,"scriptId":"8","url":"http://localhost:8080/frameworks/keyed/react-hooks/dist/main.js"}},"cat":"devtools.timeline","dur":9,"name":"FunctionCall","ph":"X","pid":2452101,"tdur":9,"tid":1,"ts":175708668233,"tts":1951626}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"columnNumber":43935,"frame":"F24A992E3DB781CA978E666125B65C79","functionName":"Pn","lineNumber":1,"scriptId":"8","url":"http://localhost:8080/frameworks/keyed/react-hooks/dist/main.js"}},"cat":"devtools.timeline","dur":752405,"name":"FunctionCall","ph":"X","pid":2452101,"tdur":751725,"tid":1,"ts":175708668252,"tts":1951645}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":31665596,"usedHeapSizeBefore":33840316},"cat":"devtools.timeline,v8","dur":3184,"name":"MinorGC","ph":"X","pid":2452101,"tdur":3038,"tid":1,"ts":175708737687,"tts":2021076}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":33454884,"usedHeapSizeBefore":34322436},"cat":"devtools.timeline,v8","dur":5065,"name":"MinorGC","ph":"X","pid":2452101,"tdur":4651,"tid":1,"ts":175708775855,"tts":2059093}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":42120704,"usedHeapSizeBefore":42340088},"cat":"devtools.timeline,v8","dur":6433,"name":"MinorGC","ph":"X","pid":2452101,"tdur":6382,"tid":1,"ts":175708879028,"tts":2161854}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":44301652,"usedHeapSizeBefore":45027452},"cat":"devtools.timeline,v8","dur":6339,"name":"MinorGC","ph":"X","pid":2452101,"tdur":6276,"tid":1,"ts":175708916981,"tts":2199757}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"type":"DOMActivate"}},"cat":"devtools.timeline","dur":5,"name":"EventDispatch","ph":"X","pid":2452101,"tdur":5,"tid":1,"ts":175709420681,"tts":2703392}}
{"method":"Tracing.dataCollected","params":{"args":{"beginData":{"frame":"F24A992E3DB781CA978E666125B65C79"},"elementCount":90001},"cat":"blink,devtools.timeline","dur":319624,"name":"UpdateLayoutTree","ph":"X","pid":2452101,"tdur":319625,"tid":1,"ts":175709420793,"tts":2703504}}
{"method":"Tracing.dataCollected","params":{"args":{"beginData":{"dirtyObjects":120007,"frame":"F24A992E3DB781CA978E666125B65C79","partialLayout":false,"totalObjects":120071},"endData":{"root":[0,0,2384,0,2384,1430,0,1430],"rootNode":15}},"cat":"devtools.timeline","dur":432684,"name":"Layout","ph":"X","pid":2452101,"tdur":432693,"tid":1,"ts":175709743317,"tts":3026028}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"frame":"F24A992E3DB781CA978E666125B65C79"}},"cat":"devtools.timeline","dur":129279,"name":"UpdateLayerTree","ph":"X","pid":2452101,"tdur":129282,"tid":1,"ts":175710176025,"tts":3458743}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"clip":[0,0,2384,0,2384,1430,0,1430],"frame":"F24A992E3DB781CA978E666125B65C79","layerId":9,"nodeId":15}},"cat":"devtools.timeline,rail","dur":14,"name":"Paint","ph":"X","pid":2452101,"tdur":14,"tid":1,"ts":175710305331,"tts":3588051}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"clip":[0,0,2354,0,2354,5430,0,5430],"frame":"F24A992E3DB781CA978E666125B65C79","layerId":11,"nodeId":15}},"cat":"devtools.timeline,rail","dur":4576,"name":"Paint","ph":"X","pid":2452101,"tdur":4577,"tid":1,"ts":175710305354,"tts":3588074}}

total duration = ts last paint + tdur - ts click = 175710305354 + 4577 - 175708667988 = 1641,943 msecs
time to update = 175709420793 - 175708667988 = 752,805 msecs

dur_GC = 3184 + 5065 + 6433 + 6339 = 21,021 msecs

one iteration with 9272 msecs

{"method":"Tracing.dataCollected","params":{"args":{"data":{"type":"click"}},"cat":"devtools.timeline","dur":8388739,"name":"EventDispatch","ph":"X","pid":2452101,"tdur":8387571,"tid":1,"ts":175737925316,"tts":31017883}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"columnNumber":52644,"frame":"F24A992E3DB781CA978E666125B65C79","functionName":"mr","lineNumber":1,"scriptId":"8","url":"http://localhost:8080/frameworks/keyed/react-hooks/dist/main.js"}},"cat":"devtools.timeline","dur":7,"name":"FunctionCall","ph":"X","pid":2452101,"tdur":7,"tid":1,"ts":175737925420,"tts":31017987}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"columnNumber":43935,"frame":"F24A992E3DB781CA978E666125B65C79","functionName":"Pn","lineNumber":1,"scriptId":"8","url":"http://localhost:8080/frameworks/keyed/react-hooks/dist/main.js"}},"cat":"devtools.timeline","dur":8388585,"name":"FunctionCall","ph":"X","pid":2452101,"tdur":8387418,"tid":1,"ts":175737925436,"tts":31018003}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":194932976,"usedHeapSizeBefore":197014736},"cat":"devtools.timeline,v8","dur":6829,"name":"MinorGC","ph":"X","pid":2452101,"tdur":6395,"tid":1,"ts":175738473327,"tts":31565893}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":199534964,"usedHeapSizeBefore":200270600},"cat":"devtools.timeline,v8","dur":7144,"name":"MinorGC","ph":"X","pid":2452101,"tdur":7101,"tid":1,"ts":175740148866,"tts":33240932}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":204488224,"usedHeapSizeBefore":205061120},"cat":"devtools.timeline,v8","dur":7973,"name":"MinorGC","ph":"X","pid":2452101,"tdur":7686,"tid":1,"ts":175741897353,"tts":34989347}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":210313232,"usedHeapSizeBefore":210692012},"cat":"devtools.timeline,v8","dur":8028,"name":"MinorGC","ph":"X","pid":2452101,"tdur":7913,"tid":1,"ts":175743871037,"tts":36962712}}
{"method":"Tracing.dataCollected","params":{"args":{"type":"allocation failure","usedHeapSizeAfter":215287136,"usedHeapSizeBefore":215856808},"cat":"devtools.timeline,v8","dur":7666,"name":"MinorGC","ph":"X","pid":2452101,"tdur":7621,"tid":1,"ts":175745612031,"tts":38703501}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"type":"DOMActivate"}},"cat":"devtools.timeline","dur":8,"name":"EventDispatch","ph":"X","pid":2452101,"tdur":7,"tid":1,"ts":175746314044,"tts":39405443}}
{"method":"Tracing.dataCollected","params":{"args":{"beginData":{"frame":"F24A992E3DB781CA978E666125B65C79"},"elementCount":90001},"cat":"blink,devtools.timeline","dur":307108,"name":"UpdateLayoutTree","ph":"X","pid":2452101,"tdur":307109,"tid":1,"ts":175746314292,"tts":39405682}}
{"method":"Tracing.dataCollected","params":{"args":{"beginData":{"dirtyObjects":120007,"frame":"F24A992E3DB781CA978E666125B65C79","partialLayout":false,"totalObjects":120071},"endData":{"root":[0,0,2384,0,2384,1430,0,1430],"rootNode":113}},"cat":"devtools.timeline","dur":434776,"name":"Layout","ph":"X","pid":2452101,"tdur":434778,"tid":1,"ts":175746624367,"tts":39715759}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"frame":"F24A992E3DB781CA978E666125B65C79"}},"cat":"devtools.timeline","dur":133791,"name":"UpdateLayerTree","ph":"X","pid":2452101,"tdur":133788,"tid":1,"ts":175747059168,"tts":40150560}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"clip":[0,0,2384,0,2384,1430,0,1430],"frame":"F24A992E3DB781CA978E666125B65C79","layerId":58,"nodeId":113}},"cat":"devtools.timeline,rail","dur":15,"name":"Paint","ph":"X","pid":2452101,"tdur":15,"tid":1,"ts":175747192985,"tts":40284373}}
{"method":"Tracing.dataCollected","params":{"args":{"data":{"clip":[0,0,2354,0,2354,5430,0,5430],"frame":"F24A992E3DB781CA978E666125B65C79","layerId":60,"nodeId":113}},"cat":"devtools.timeline,rail","dur":4740,"name":"Paint","ph":"X","pid":2452101,"tdur":4741,"tid":1,"ts":175747193009,"tts":40284397}}

total = 175747193009 + 4741 - 175737925316 = 9272,434 msecs
time to update = 175746314292 - 175737925316 = 8388,976 msecs

dur_GC = 6829 + 7144 + 7973 + 8028 + 7666 = 37,640 msecs

=> duration of reported GC doesn't explain the slow duration
=> delay occurs before UpdateLayoutTree starts. Updating and painting is similarly fast.```

ryansolid commented 4 years ago

The forked results look clean. The running of tests taking longer is difficult.

The variance of removing warmup runs is really only going to affect 2 classes of libraries. The ones which don't do particularly well here, and the ones so tight around the front where every run could mean jumping around 10 or so positions. Neither of these cases should matter much, because that is the reality of things. But issues like https://github.com/krausest/js-framework-benchmark/issues/728 are exactly the sort of mentality we find here. I know I am doing that mental math every time something updates. Putting more value in the median than the mean results etc. Many people would prefer this consistency over correctness. Viewing the list as a sort of ranking of sorts. I think we should probably ignore those people(myself included) since that isn't how this works. It's just the fiction we like to spin.

I don't think removing some of the warmups saves us much execution time compared to what is lost forking the chrome driver but maybe it will produce the best results.

krausest commented 4 years ago

Now that it looks more like a chrome issue as a issue with my benchmark driver I updated the benchmark driver to allow batching (perform multiple runs in one chromedriver instance just as it used to). Batching is used for all CPU benchmarks except those with large tables. The select row benchmark runs now without warmup iterations. The results for chrome 83 can be found here: https://krausest.github.io/js-framework-benchmark/2020/table_chrome_83_fixed.html

Any comments on those results? (Besides dojo not working)

leeoniya commented 4 years ago

Any comments on those results? (Besides dojo not working)

this is nothing new, though while we're on the topic of warmups and other bench-specific artificial stuff...

i still think it's odd to have a 16ms (1 frame) threshold for green-ness while also having some benchmarks running with 16x CPU slowdown (even my 2015 phone runs at ~4x slowdown). i feel like we need to better differentiate the sea of green without resorting to unrealistic cpu slowdowns to eek out the differences. i don't really have a solution, but figured i'd complain about it again :D

i've suggested before to simply subtract the fastest global numbers from all the metrics and do something with that diff rather than relying on a % over baseline...which i think forces the artificial slowdowns.

krausest commented 4 years ago

Opened a chrome bug https://bugs.chromium.org/p/chromium/issues/detail?id=1087523 I'm actually afraid it could get as little attention as the other chrome bug report.

brainkim commented 4 years ago

I noticed this problem went away during manual testing, and looking through the linked chromium issue, it looks like they managed to find the root cause of the problem, and it’s fixed in both Chrome 83 and 85. I can’t reproduce the wild variance on 83.0.4103.106 (previously I was on 83.0.4103.97).

krausest commented 4 years ago

I can confirm that variance is back to normal values since chrome 83.0.4103.106.

krausest / js-framework-benchmark

benchmark driver with chrome 83 causes very large variance #741