Suggestion: repeat short, high-variability tests more than long, low-variability tests to improve estimates

adamhaile commented 5 years ago

I'd been meaning to post this idea for a while, and since I think you're prepping a new round I figured now was as good a time as any :).

In the results, there's a relationship b/w test duration and variance (measured as percent). Long tests tend to be low variance (2-5%) while short tests tend to have high variance (10-30%). Poster child for this is swap rows. This means that the short tests contribute a disproportionate amount of variance to the overall geometric mean.

Fix would be to run the short tests for more repeats. Since they're short, this doesn't greatly extend the test duration. In fact, some of the very long running and low variance tests (create many rows, append rows) could use fewer repeats, resulting in an overall test that was both faster and more accurate.

leeoniya commented 5 years ago

i've kinda been trying to get this point across as well. this would also help with the clamping situation (#335).

i wanna experiment just doing a straight sum of 20 iterations (no warmup) and get rid of all 10k tests (they're there mostly to eek out meaningful differences in "cheap" tests like update mod 10, which could easily be made into mod 5 or mod 2).

i think there's a lot of room for improvement in decreasing test runtime and also not having to deal with statistics needed for low sample counts (10 runs) and often high variance.

krausest commented 5 years ago

Sounds good. After finishing renaming the directories I'll consider it.

ryansolid commented 5 years ago

The one thing I like about the 10k rows is the partial updates test. The idea here isn't that we are updating every row or every other row it is that libraries that depend on dirty checking like methods to determine if something has changed pay a higher cost by having to iterate over 10k every cycle. It's the closest this benchmark comes to showing what happens when partial changes happen over time. Sure in the realworld case we only have a few hundred rows mostly but scenarios updating a row or a few over time which over the course of half a second would result in multiple traversals over the data set is closer to the sparse 10k test. That being said most libraries handle it well, but where they don't I think it is a pretty significant fault. And one of the areas that is already generally underrepresented in the benchmarks of this nature. It's definitely something I consider as one of the key performance indicators when looking at frameworks.

krausest commented 5 years ago

As I promised I took a look at it. Here are 10, 15, 20 and 30 runs for create, update, select and swap for vanillajs and surplus.

I believe I see a rather clear trend towards a lower variance for create rows, maybe a slight tendency for update rows, but it doesn't look like increasing iteration count reduces variance for the real short running benchmarks. Your take?

adamhaile commented 5 years ago

I may be wrong here, but I thought the +/- numbers were the standard deviation of the samples, not the standard error of the estimate? So we wouldn't expect it to fall with more iterations, but we would expect the standard error to improve.

krausest commented 5 years ago

It is the std deviation. I suppose I have to learn more about it: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1255808/ What would be more interesting to you: SD or SE?

leeoniya commented 5 years ago

i'd say SE is actually more intuitive as it's a proxy for how confident you are of the given number being the true mean. giving an SD just tells someone that 68% of the samples fall within the +/- figure, which is an indication of spread but not what a layman would expect it to represent (an overwhelming majority of samples falling within this range, which 68% is not).

adamhaile commented 5 years ago

Personally, I'd prefer SD on the actual times and add SE to the scaled times, like "1.06 +/- 0.035". SD on the actual measurements to characterize how the tests went, but SE on the scaled time to show the variation in the relative performance estimate.

leeoniya commented 5 years ago

my background is in engineering, where +/- indicates tolerance. often it is used to represent measurement error / confidence. it would be unusual but still logical if it represented the range where the vast majority of the measurements fell. but having it represent 1 SD is neither here nor there, it's some odd middle 68%. if it represented 2 SD (95%) that would be much better. FWIW, i dont think i've ever encountered a case where 1 SD was the implied unit of measure after a +/-. i found it very surprising.

https://en.wikipedia.org/wiki/Plus-minus_sign#In_statistics

adamhaile commented 5 years ago

I seem to see +/- representing 1 SD as more common in scientific literature. Anyway, I'm good with just using SE at a 95% CI. I would like to see it on the scaled values too, as it gives a clearer picture of how much noise each contributes to the geometric mean. For that matter, it would be cool to add a derived SE to the geometric mean as well.

ryansolid commented 5 years ago

I was thinking today again about Partial Updates and how the test almost isn't sparse enough to demonstrate what it should be. Like maybe if it was 1 in 50 it would better contrast the difference approaches that do subscriptions vs those that reiterate over every item each time. But of course at that point it approaches the single frame issue.

So had a weird thought. Has thought of nested arrays ever come up. Instead of 10,000 rows, have 10 tables of 1000 rows(or some other similar split), I know it's come up to swap more rows at a time etc, but we didn't want to drastically change the solution for swap/selection etc, but it might be easier to view each table as sub data sets(components) each with their own data, and selection state but controlled by the same buttons. Remove and Select Row inputs are a bit awkward in not being humanly possible simultaneously across multiple tables but I imagine could be driven.

This could take those fast tests to run 10x(or more) times as much to slow them down, and offer interesting ways to play with sparseness, and nested concerns. I think the fact that no scenario in the benchmark would have someone write something like React's shouldComponentUpdate cuts out a whole slew of variation in how change propagation is managed, that could be illuminating very much how changing the swapRows from 5 - 10 to 2 - 999 was. I'm not sure this suggestion would actually result in that but just trying to think of other ways to do more with the shorter tests.

leeoniya commented 5 years ago

that could be illuminating very much how changing the swapRows from 5 - 10 to 2 - 999 was

the reason this had such a drastic effect is because i found that some libs were actually reordering/recreating all rows between the 2 swapped ones. it was just cheap enough for 7 rows that it got lost in the noise.

this situation (edit: for the update 10k bench) is different in that most (all?) libs do near-optimal dom ops already. which means the difference is purely in the JS efficiency - and this is the root of the problem with both, evaluating the perf as a sum of all costs and artificially stressing the dom to get that sum to somehow diverge in a statistically meaningful way with low variance.

i think we'd be better off with a 2-stage bench. we know that much of the time, the DOM/layout/render is the bottleneck. this carries with it the assumption that if you can't do dom ops optimally, you cannot make up that difference in JS efficiency. so we can separate the libs (or more granularly, lib + metric) into those that perform optimal or near-optimal dom ops and those that don't.

for those that do optimal dom ops, we can measure JS/GC time only and in something like ops/s (collected from a sum of a few dozen runs) rather than ms/op collected from a high variance / low count sample, this will reveal a much wider spread between all libs that currently show green. this would likely translate to better perf and battery life on mobile devices.

the dom needs to be just big enough where inefficiencies in it cannot be made up by better js. beyond that, bloating the dom size does not get us any closer to getting meaningful numbers (imo).

krausest commented 5 years ago

I've been experimenting a bit and here's an update.

The result table prints now the margin of error by default (for a confidence interval of the mean, for a 95%-level)
Increasing count for select and swap rows

Here's a picture for independent runs for several counts.

I'm looking for your interpretation but to me it seems that increasing the count does not help reducing the noise, though the margin of error shrinks (but I guess that's no surprise to you). For reference this is a picture for the corresponding margin of error (half of the confidence interval length)

So I stopped pursuing that.

Using the devtools to emulate CPU slowdown.

It was hard to find some information how to achieve this, but once you find it the change was trivial. I've used a 4x CPU slowdown for swap rows and a 16x CPU slowndown for select (admittedly pretty high). I'm curious to getting your feedback whether it actually helps. To me it seems like the stability of the results is lower than for the other benchmarks, but overall OK (at least to me).

Seems like domc has a weak spot in select row 😄

The significance for swap rows is a bit worse, but elm and attodom can pretty consistently be identified to be significantly slower than vanillajs-keyed:

Experimental results are here.

krausest commented 5 years ago

Here are boxplots for some frameworks. Swap rows:

and for select rows:

hville commented 5 years ago

For info, attodom minimises the use of cached values and iterates through all nodes for all operations. The (untested) idea was that this would keep the memory footprint small and minimise GC, even if a little slower overall (at least on fast machines).

Is there an overall correlation between the result variance and the run memory outcomes?

localvoid commented 5 years ago

@hville keep in mind that in this benchmark almost all memory allocations will be done in the nursery (super cheap) and objects won't be promoted to the old generation. In real applications everything will be completely different, especially with pretenuring.

leeoniya commented 5 years ago

@krausest have you considered/tried running everything with 16x slowdown (to simulate mobile) and to reduce the dom size to something closer to real life (just drop the 10k tests)?

this may have the benefit of both being more real-world plus showing greater separation since js would account for a bigger % of the total time.

also, now that everything is > 16ms, the clamping code can be removed 👍

krausest commented 5 years ago

@leeoniya I don't consider running all benchmarks with slowdown. I'm not confident enough that the throttling is realistic.

It might be a good idea to use it for all benchmarks that use 10k tables if it pronounces differences between the frameworks.

Here are the results: https://krausest.github.io/js-framework-benchmark/2018/t181028.html Partial update, append rows and clear rows are included in the orignial form and with CPU throttling.

Looking at the results I'm pretty undecided: E.g. comparison against solid for 10k rows:

And here for 1k rows with 16x throttling:

Append for 10k rows with comparison against surplus:

And for 1k rows:

Clear rows for 10k rows with comparison against solid:

And for 1k rows:

Seems like CPU throttling helps a just bit. @leeoniya @adamhaile @ryansolid What would you prefer to see in future?

leeoniya commented 5 years ago

looks pretty good to me. i think the same slowdown (16x or 8x) could be used across the board rather than tweaking it per metric? this way the "mobile simulation" caveat can be added for the whole table rather than per row.

What would you prefer to see in future?

a big but realistic dom size that avoids artificial/synthetic memory pressure - 1k rows (8k nodes) achieves this.
a slowdown that simulates mid-range mobile - 16x or 8x feels close.
all metrics crossing the 16ms threshold to remove clamping.

in the future, the ability to compare cumulative js/gc time vs layout/render. https://github.com/speedracer/speedracer looks like it has the necessary perf timeline filters to collect these things.

ryansolid commented 5 years ago

I do see how the universal slowdown mode could give a whole additional way to review results. But it seems slowdown is at minimum necessary on some of the base tests. Replacing 10k tests is interesting. From that perspective Create 10k should also be in the mix.

I also find it interesting that vanilla-js-1 is clearly weaker in some areas under slowdown where it is quicker not. Things clearly don't scale the same way. So while arguably this more valuable than the 10k tests, perhaps less so on everything.

krausest commented 5 years ago

So I decided to switch clear, append and update to use 1k rows and CPU throttling. I'll keep create many as the only benchmark to use 10k rows. I decided against using the same multiplier for throttling, instead I'm dynamically adjusting the throttling to reach sensible durations (and since I don't think there's the one reasonable multiplier, ARM CPUs can be pretty close to my i7). I killed bootstrap and es5 polyfills from the new result table (@leeoniya I hope for some applause here) and added box plots as another result view, which works best if only some frameworks are chosen. I added plotly as an dependency for that (@leeoniya I expect the applause to stop here). Here's an example:

I think it's time to close this issue.

krausest / js-framework-benchmark

Suggestion: repeat short, high-variability tests more than long, low-variability tests to improve estimates #424