Number of warm-up runs should not be the same every time

krausest / js-framework-benchmark

A comparison of the performance of a few popular javascript frameworks

https://krausest.github.io/js-framework-benchmark/

Apache License 2.0

6.84k stars 841 forks source link

Number of warm-up runs should not be the same every time #1565

Open twharmon opened 10 months ago

twharmon commented 10 months ago

I found an irregularity on a handful of benchmarks where the duration was 10 times the typical duration, just because the assigned number of warm-up runs led to garbage collection on the timed run (or something like that).

In my example, select row took ~6ms when the number of warm-up runs was 0, 10, or 20. But it took 60ms with 5 warm-up runs.

I was using tinygo to compile to wasm, so it could be that specific allocator / runtime that leads to this issue.

The difference is easy to see in the screenshot below. goui-alt is the same app compiled with the standard go compiler. Tinygo performs better than Go for the most part, but the warm-up runs in a few instances can cause tinygo's runtime to be slow in a consistent way. results

Just changing the number of warm-up runs to 8 completely changed the results: 8warmups

Suggestion

Instead of 5 warm-ups for everything, why don't we make it rand(5), rand(10), or nth-run % 5?

krausest commented 10 months ago

Originally I wanted to include GC in the results (and thus punish high GC pressure), but it turned out that this conflicted with the goal to reduce variance. Thus some time ago I decided to run GC before each benchmark iteration: https://github.com/krausest/js-framework-benchmark/blob/master/webdriver-ts/src/forkedBenchmarkRunnerPuppeteer.ts#L127

So I think the problem you describe occurs only for WASM frameworks since they managed the memory on their own. I'm not sure what's the best way to handle that.

Keeping GC for JS might penalize WASM frameworks
Removing GC from JS introduces noise that makes comparison between close contenders worthless

I think I'll check the next few days how big the impact is when including GC again. Maybe then randomizing the warmup count will make sense.

Erithax commented 4 months ago

Dioxus has wild variation (7.1 to 13.8, +94%) in select row between the Chrome 124 and 126 bench. Which use the same source for Dioxus. Perhaps due to this issue. Clear rows also had ca. +45%.

krausest commented 4 months ago

Chrome 124 was really a challenging release and it forced me to rework the full GC before running the benchmark. Nevertheless it was necessary to produce results with low variance.

I never spent much thoughts on what the impact of that GC for webasm implementations was. I guess it might help regarding DOM nodes, but of course not regarding the WASM heap. All in all it should help JS implementations way more than WASM implementations (at least those that must perform a memory cleanup during the benchmark, which isn't triggered by a chrome GC). Does anyone know whether such expensive cleanup of the WASM heap (compaction / "internal" GC) happens for one of the implementations?