Consistent results - Githubissues

luwes commented 4 years ago

At the moment results could be skewed quite a bit from run to run.

Ideally each test should run a minimum of 3 times, randomize library order at least, and take some average

blankhart commented 4 years ago

I'd also create a single shuffled set used by all libraries in the randomized benchmark to make it easier to compare the number of operations each does on the same data. (Or fix and reset the generator seed.)

leeoniya commented 4 years ago

you guys may find this part of domvm's tests useful:

https://github.com/domvm/domvm/blob/master/test/src/flat-list-keyed-fuzz.js

it fuzzes a bunch of lists with various amounts of adds, moves & deletes.

WebReflection commented 4 years ago

after running this benchmark in various machines with different CPUs, I've noticed that consistent results are highly improbable in here, due to the following scenarios:

the shuffle/random test is way too unpredictable. In some case some algo scores exceptionally well, in some other case the variation is around 20ms
the same creation itself of 10k rows vary from 1ms to 20ms too. The operation is the same, and it's linear, so these variations indicate the browser might de-opt while benchmarking this or that library, and the testing order might matter too, as the de-opt likely happens at the end of the list, where all other libs had RAM for optimizations, but GC could kick in
the total amount of time doesn't seem too accurate/relevant, as it differs way too often from the sum of all benchmarks

Accordingly, we should change the way we measure each library in this way:

each libarry is run in isolation, with at least these steps:
- an assertion check to be sure the library works as expected, and drop the result if it doesn't
- a warmup of 3 to 10 operations (IIRC js-framework-benchmark uses a loop of 5 warmup, and per each benchmark)
each benchmark runs 3 times and collects the average + deviation (min/max execution time over exact same test)
each benchmark might be influenced from the random case, so in that specific case, we might run it three times, but with 3 different shuffles, to hopefully catch the best case, and the worse, calculate an average, and its deviation. Maybe this test should run 10 times

This means that there's a lot of work to do to split tests, assertions, and warmup per test a part, but that's likely the best way to have meaningful results on both synthetic benchmark and live one.

luwes / js-diff-benchmark

Consistent results #12