tinylibs / tinybench

🔎 A simple, tiny and lightweight benchmarking library!
MIT License
1.73k stars 36 forks source link

The first run is significantly faster even if .warmup() is used #46

Closed yy0931 closed 11 months ago

yy0931 commented 1 year ago

I've been comparing the performance of two versions of the same function, and I've noticed that the first one is always significantly faster, possibly due to JIT.

bench.mjs

import { Bench } from "tinybench"

(async () => {
    const bench = new Bench()

    for (const [k, v] of Object.entries({
        "a": (i) => i, // The simpler the compared function, the greater the difference in ops/sec.
        "b": (i) => i,
        "c": (i) => i, // Comparing three functions to illustrate that only the first one is fast.
    })) {
        bench.add(k, () => {
            for (let i = 0; i < 1000; i++) {
                v(i)
            }
        })
    }

    await bench.run()
    console.table(bench.table())
})()
$ node bench.mjs
┌─────────┬───────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │  ops/sec  │ Average Time (ns)  │  Margin  │ Samples │
├─────────┼───────────┼───────────┼────────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '847,777' │ 1179.5548020515819 │ '±1.01%' │ 423889  │
│    1    │    'b'    │ '76,335'  │ 13100.02999778449  │ '±0.41%' │  38168  │
│    2    │    'c'    │ '80,586'  │ 12409.009305943227 │ '±0.32%' │  40294  │
└─────────┴───────────┴───────────┴────────────────────┴──────────┴─────────┘

I've also tried increasing time and using .warmup() but it didn't help so much.

import { Bench } from "tinybench"

(async () => {
    const bench = new Bench({ time: 5000, warmupTime: 5000, warmupIterations: 100 })

    for (const [k, v] of Object.entries({
        "a": (i) => i,
        "b": (i) => i,
        "c": (i) => i,
    })) {
        bench.add(k, () => {
            for (let i = 0; i < 1000; i++) {
                v(i)
            }
        })
    }

    await bench.warmup()
    await bench.run()
    console.table(bench.table())
})()
$ node bench.mjs
┌─────────┬───────────┬─────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns)  │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼────────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '1,098,539' │ 910.2996506649747  │ '±0.16%' │ 5492697 │
│    1    │    'b'    │  '96,670'   │ 10344.420296798786 │ '±0.33%' │ 483353  │
│    2    │    'c'    │  '94,520'   │ 10579.704557811108 │ '±0.32%' │ 472603  │
└─────────┴───────────┴─────────────┴────────────────────┴──────────┴─────────┘

Am I missing something? Or could there be an option to run benchmarks in the sequence of a, b, c, a, b, c, ...?


Versions:

Aslemammad commented 1 year ago

Have you tried adding a noop function? bench.add('noop, () => {}) as the first task? let me know the results.

yy0931 commented 1 year ago

The issue is resolved only when .warmup() is used. So, is it an issue with .warmup()?

import { Bench } from "tinybench"

(async () => {
    const bench = new Bench()

    bench.add("noop", () => { })
    for (const [k, v] of Object.entries({
        "a": (i) => i,
        "b": (i) => i,
        "c": (i) => i,
    })) {
        bench.add(k, () => {
            for (let i = 0; i < 1000; i++) {
                v(i)
            }
        })
    }

    await bench.warmup()
    await bench.run()
    console.table(bench.table())
})()
$ node bench.mjs  # with noop
┌─────────┬───────────┬─────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns)  │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼────────────────────┼──────────┼─────────┤
│    0    │  'noop'   │ '3,936,461' │ 254.03525639294233 │ '±1.55%' │ 1968231 │
│    1    │    'a'    │  '106,091'  │ 9425.814974822304  │ '±0.13%' │  53046  │
│    2    │    'b'    │  '106,435'  │ 9395.372655685504  │ '±0.11%' │  53218  │
│    3    │    'c'    │  '108,210'  │ 9241.250181165182  │ '±0.09%' │  54106  │
└─────────┴───────────┴─────────────┴────────────────────┴──────────┴─────────┘
$ node bench.mjs  # without nnop
┌─────────┬───────────┬───────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │  ops/sec  │ Average Time (ns)  │  Margin  │ Samples │
├─────────┼───────────┼───────────┼────────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '469,971' │ 2127.7896772307904 │ '±0.62%' │ 234986  │
│    1    │    'b'    │ '88,462'  │ 11304.176861550312 │ '±0.17%' │  44232  │
│    2    │    'c'    │ '107,052' │ 9341.195247464535  │ '±0.12%' │  53527  │
└─────────┴───────────┴───────────┴────────────────────┴──────────┴─────────┘
$ node bench.mjs  # with noop but without warmup
┌─────────┬───────────┬─────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns) │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼───────────────────┼──────────┼─────────┤
│    0    │  'noop'   │ '4,277,324' │ 233.7910108129855 │ '±0.40%' │ 2138663 │
│    1    │    'a'    │ '1,156,223' │ 864.8847551251565 │ '±0.25%' │ 578112  │
│    2    │    'b'    │  '107,419'  │ 9309.331318347942 │ '±0.29%' │  53710  │
│    3    │    'c'    │  '107,332'  │ 9316.871354411698 │ '±0.34%' │  53667  │
└─────────┴───────────┴─────────────┴───────────────────┴──────────┴─────────┘
Aslemammad commented 1 year ago

This is a trait many benchmarking tools have because it's more related to the engine and JIT, so I don't have any solution to it currently except the warmup + noop, I need to look into it, meanwhile, let me know if you found anything. Thank you for the issue.

weedz commented 1 year ago

This is interesting. Part of it seems to be something with for (const [k, v] of Object.entries()). When I run with

for (const [k, v] of Object.entries({
    "a": (i) => i,
    "b": (i) => i,
    "c": (i) => i,
})) {
    bench.add(k, () => {
        for (let i = 0; i < 1000; i++) {
            v(i)
        }
    })
}

I get the following result:

┌─────────┬───────────┬─────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns)  │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼────────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '3,371,257' │ 296.6252305640348  │ '±0.36%' │ 1685629 │
│    1    │    'b'    │  '282,824'  │ 3535.7620041612818 │ '±0.10%' │ 141413  │
│    2    │    'c'    │  '276,717'  │ 3613.790052803212  │ '±0.06%' │ 138359  │
└─────────┴───────────┴─────────────┴────────────────────┴──────────┴─────────┘

which corresponds with your results. Somewhat surprising we get a similar result, although not as 'extreme', with the following:

function a(i: number) {
    return i + 1;
}

for (const [k, v] of Object.entries({
    a: a,
    b: a,
    c: a,
})) {
    bench.add(k, () => {
        for (let i = 0; i < 1000; i++) {
            v(i)
        }
    })
}

and the results:

┌─────────┬───────────┬─────────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns)  │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼────────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '3,435,444' │ 291.0831710293736  │ '±0.31%' │ 1717723 │
│    1    │    'b'    │ '1,996,490' │ 500.8789203915326  │ '±0.27%' │ 998246  │
│    2    │    'c'    │ '1,995,535' │ 501.11870726961496 │ '±0.04%' │ 997768  │
└─────────┴───────────┴─────────────┴────────────────────┴──────────┴─────────┘

(would assume a(i) could be optimized in the same way for every entry, and does not make sense the first entry is the fastest.)

This is consistent; you change the order and the first task is always way faster.

But if we instead do the following:

bench.add("a", () => {
    for (let i = 0; i < 1000; i++) {
        a(i);
    }
});
bench.add("b", () => {
    for (let i = 0; i < 1000; i++) {
        a(i);
    }
});
bench.add("c", () => {
    for (let i = 0; i < 1000; i++) {
        a(i);
    }
});

the results changes:

┌─────────┬───────────┬─────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns) │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼───────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '1,962,552' │ 509.5404296036191 │ '±0.31%' │ 981277  │
│    1    │    'b'    │ '3,395,278' │ 294.5266832733146 │ '±0.08%' │ 1697640 │
│    2    │    'c'    │ '3,406,314' │ 293.5724529795157 │ '±0.17%' │ 1703158 │
└─────────┴───────────┴─────────────┴───────────────────┴──────────┴─────────┘

And when we add a "noop" task (bench.add("noop", ()=>{})) the results are the following:

┌─────────┬───────────┬──────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec    │ Average Time (ns) │  Margin  │ Samples │
├─────────┼───────────┼──────────────┼───────────────────┼──────────┼─────────┤
│    0    │  'noop'   │ '15,179,119' │ 65.87997525602914 │ '±0.34%' │ 7589560 │
│    1    │    'a'    │ '3,373,602'  │  296.41907343958  │ '±0.14%' │ 1686802 │
│    2    │    'b'    │ '3,377,632'  │ 296.0653358008033 │ '±0.51%' │ 1688818 │
│    3    │    'c'    │ '3,409,823'  │ 293.2702997713696 │ '±0.19%' │ 1704912 │
└─────────┴───────────┴──────────────┴───────────────────┴──────────┴─────────┘

Seems to be consistent (even without await bench.warmup().)

So it could be fine if we just run a "noop" operation for one second (or 'a while') in every call to bench.run() to allow the engine to "JIT" the code?

(all tests were ran on linux 6.3.9 and node v20.3.1)

Aslemammad commented 1 year ago

That can be a good solution to be injected in the library! and then removed from the results.

jcbhmr commented 1 year ago

I encountered this too. I "solved" my problem by doing a bunch of warmup runs before running the actual runs. This shim seemed to work rather well for sync-only stuff:

for (x = now(), y = x + 1000000000n; x < y; x = now()) {
  f();
}
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
  start = now();
  f();
  end = now();
  total += end - start;
  count++;
}
```js const bench = { tasks: [], results: {}, add(n, f) { this.tasks.push([n, f]); }, run() { let start; let end; const now = process.hrtime.bigint; let n; let f; let x; let y; let count = 0n; let total = 0n; for (let i = 0; i < this.tasks.length; i++) { n = this.tasks[i][0]; f = this.tasks[i][1]; for (x = now(), y = x + 1000000000n; x < y; x = now()) { f(); } for (x = now(), y = x + 1000000000n; x < y; x = now()) { start = now(); f(); end = now(); total += end - start; count++; } this.results[n] = total / count; } }, table() { return this.results; }, }; function a(i) { return i + 1; } for (const [k, v] of Object.entries({ a: a, b: a, c: a, })) { bench.add(k, () => { for (let i = 0; i < 1000; i++) { v(i); } }); } await bench.run(); console.table(bench.table()); ```

image

Aslemammad commented 1 year ago

I think we should somehow add a noop warmup to benchmarking, so users won't have to do it themselves.

jcbhmr commented 1 year ago

I think this might be fixable by re-working the way you divide between warmup() and run() and doing something like this:

// default to TRUE
await bench.run()
await bench.run({ warmup: true })
await bench.run({ warmup: false })

That way you let the JIT optimize per run instead of all upfront which then gets destroyed after you're done with the first task. basically this:

// WARMUP PART
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
  f();
}

// ACTUAL BENCHMARK
for (x = now(), y = x + 1000000000n; x < y; x = now()) {
  start = now();
  f();
  end = now();
}

👆 at least that kind of strategy worked for me 🤷‍♂️ might be worth a try

Aslemammad commented 1 year ago

@weedz @jcbhmr @yy0931 Hey folks, please try #50 and let me know.

yy0931 commented 1 year ago

@weedz @jcbhmr @yy0931 Hey folks, please try #50 and let me know.

I've tested it on Node v16-20 and it's working for me.

weedz commented 1 year ago

@weedz @jcbhmr @yy0931 Hey folks, please try #50 and let me know.

Can also confirm it seems to be working fine now. Much more consistent results :+1: (only tested on node 20.5)

etki commented 3 weeks ago

Hello, a java passerby here.

First things first. If you deoptimize the code, there's little sense in benchmarking it. While i don't have a full insight into node, at this point you're either benchmarking a low-quality JIT pass, or, even worse, an interpreter. Benchmarking tools are supposed to gather data about real life scenarios, and in those we expect functions to be optimized. There are no guarantees that if F1 is deoptimizedly faster than F2, then the relation is preserved under optimizations. Further, benchmarking exists not to get fancy numbers, but rather to find the root causes and explanations (and, of course, there is an explanation for this behavior). I also found it a bit surprising that the issue is rather ignored than researched - benchmarking tools have to be very certain about the results they provide, and pursuing just a stable output doesn't guarantee it's correct, so whenever a thing like this appears in the background, it has to be researched, otherwise it's possible that every result obtained by the tool is wrong.

Now to the gory explanatory stuff. It turned out to be pretty simple; i've observed the same behavior in toy scripts.

The first clue is the fact that only the first benchmark exhibits strange behavior, no matter how you shuffle them. Moreover, it is not some kind of an outlier, because no matter how long you run it, the numbers and relations are the same. The problem here is systematic and infrastructure-level; the first question would be if it's a framework or something deeper. As i've mentioned toy scripts - we can be certain it's rooted in the node itself. To continue poking the problem, we can look into the fact that while functions are identical, they are not the same object. What if all three benchmarks would be the same function? This result

┌─────────┬───────────┬─────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns) │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼───────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '2,971,961' │ 336.4780581840262 │ '±2.02%' │ 1485981 │
│    1    │    'b'    │  '201,559'  │ 4961.308762490288 │ '±0.66%' │ 100780  │
│    2    │    'c'    │  '180,084'  │ 5552.938490784653 │ '±0.21%' │  90043  │
└─────────┴───────────┴─────────────┴───────────────────┴──────────┴─────────┘

Transforms into

┌─────────┬───────────┬─────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec   │ Average Time (ns) │  Margin  │ Samples │
├─────────┼───────────┼─────────────┼───────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '3,036,416' │ 329.3356284034027 │ '±0.38%' │ 1518209 │
│    1    │    'b'    │ '1,760,917' │ 567.8857256123154 │ '±0.24%' │ 880459  │
│    2    │    'c'    │ '1,729,899' │ 578.0681242101545 │ '±0.37%' │ 864950  │
└─────────┴───────────┴─────────────┴───────────────────┴──────────┴─────────┘

So we're on the right track here. There is an optimization that relies on the fact that identical input is used. Are there any additional differences in the invocations of the first and the second benchmark? It's the name, k in the example above. I've struggled to provide a further example that would demonstrate that with elimination of it the issue would (should) be fully gone, but thankfully at this point i simply now what's happening, so i can stop with investigation here and just proceed with something that would actually convert guesses into a ground truth. I've restored the original benchmark, converted arrow functions into named ones and renamed Task.run into Task.runTask for filtering:

node --print-bytecode --print-bytecode-filter=run --print-opt-code --print-opt-code-filter=runTask src/reproduction.js  | tee dist/dissect

This will produce unpleasant thousands lines of output, but the answer is there. Ctrl + F 'runTask' ->

--- Optimized code ---
optimization_id = 12
source_position = 3260
kind = TURBOFAN
name = runTask
...
optimization_id = 16
name = runTask
...
optimization_id = 17
name = runTask

The Task.run itself is being recompiled. If there is only one benchmark, there is a single compilation; if there are 2+, there are three compilations. I don't have time to spend on finding out the reason of the third one appearing (if you check it, there's a fun fact), but let's look at the first two

Inlined functions (count = 5)
 0x084663c1f3b9 <SharedFunctionInfo g>
 0x084663c1c539 <SharedFunctionInfo>
 0x084663c1c3f9 <SharedFunctionInfo a>
 0x11773271baf1 <SharedFunctionInfo now>
 0x24c7eecd08e1 <SharedFunctionInfo hrtime>
...
Inlined functions (count = 4)
 0x084663c1f3b9 <SharedFunctionInfo g>
 0x084663c1c539 <SharedFunctionInfo>
 0x11773271baf1 <SharedFunctionInfo now>
 0x24c7eecd08e1 <SharedFunctionInfo hrtime>

The benchmark function isn't inlined anymore, because, well, this run doesn't include it - and TurboFan has to recompile everything it has previously assembled in a giant blob, keeping in mind that everything except a is left the same. This is not just the benchmark code, this is a fusion of benchmark code and the framework; the benchmark function leaks outside. Let's do one last thing to keep the fun rolling. To put things to extreme:

(async () => {
  const bench = new Bench();

  bench.add('a', (i?: number) => i);

  await bench.warmup();
  await bench.run();
  console.table(bench.table());
})();

results:

┌─────────┬───────────┬──────────────┬───────────────────┬──────────┬─────────┐
│ (index) │ Task Name │   ops/sec    │ Average Time (ns) │  Margin  │ Samples │
├─────────┼───────────┼──────────────┼───────────────────┼──────────┼─────────┤
│    0    │    'a'    │ '16,953,858' │ 58.98362307122657 │ '±1.01%' │ 8476930 │
└─────────┴───────────┴──────────────┴───────────────────┴──────────┴─────────┘

We don't get much of a suspense here yet. However, as we know now, the function is inlined. While anything can be expected from an interpreted language (and others too, just not so often), including excessive machine code in some specific cases, 60ns translates to 240 cycles at 4GHz, which is somewhat my machine should be running at. For an inlined function that does absolutely nothing it's a bit too much, it should literally take zero instructions.

Let's fiddle with Task.run once more:

        } else {
          const taskStart = this.bench.now();
          for (let i = 0; i < 131072; i++) {
            this.fn();
          }
          taskTime = this.bench.now() - taskStart;
        }

        samples.push(taskTime / 131072);
        this.runs += 131072;
        totalTime += taskTime;

And then we get

┌─────────┬───────────┬─────────────────┬────────────────────┬───────────┬─────────┐
│ (index) │ Task Name │     ops/sec     │ Average Time (ns)  │  Margin   │ Samples │
├─────────┼───────────┼─────────────────┼────────────────────┼───────────┼─────────┤
│    0    │    'a'    │ '3,378,147,010' │ 0.2960202729157641 │ '±23.18%' │  12887  │
└─────────┴───────────┴─────────────────┴────────────────────┴───────────┴─────────┘

That's one CPU cycle per call, which is likely to be the loop increment (it's not as simple as that though; CPU may execute more than one instruction per cycle). Technically, JIT should have dropped the loop altogether, because this code does virtually nothing, but i guess we're not there yet. Why am i getting to it? If this would be the result from the very beginning, guessing the inlining thing would be trivial.

Of course, this deserves a bit of explanation itself. The timer calls are not free, they are relatively expensive (55ns, as we see), so this benchmark actually measures timer rather than the passed code. What's happening is that the timer function is preamble ... actual time snapshot ... aftertaste, and when there are two calls with a benchmark in between, the difference of the snapshots is rather afterstaste + benchmark + preamble than just benchmark

Dear god. How this can be fixed!?

In java, we do some compiler instructions about what can and can't be inlined. We're a bit blessed with that - node doesn't seem to have function-specific knobs, only global ones. However, there is an escape hatch (and we use it too, for the same reasons) - each benchmark should run in a separate process (and these processes must not run in parallel). In that case, all optimizations are restricted to whatever code is provided by the user, and it's their responsibility to supply homogeneous or heterogeneous input to reflect their real workload. This is still far from perfect - JIT may decide to inline one thing and forget about the other - but at least gives a decent chance to compare apples to apples.