Is an imperative vanillajs a reasonable baseline for a data-driven bench?

leeoniya commented 7 years ago

@krausest

This issue originates from https://github.com/krausest/js-framework-benchmark/issues/305#issuecomment-346257613.

There are some cases - possibly many cases - where even efficient reconcile algorithms account for 99% of the perf measurement whereas vanillajs is able to avoid all reconciliation costs (as it does in swap rows, and possibly others). This baseline is highly unrealistic if the aim is to show the fastest possible real world data-driven implementation.

Perhaps a controversial view, but i think the baseline should be determined by the fastest data-driven implementation. There is now a sufficient quantity of extremely fast libs that handily beat vanillajs. Rather than trying to adapt vanillajs to be unrealistically faster, it may be worth dropping it altogether?

If we keep adding edge cases and implementing just the dom ops in vanilla, the metrics will quickly become misleading.

/random thoughts

cc @localvoid @thysultan @hville @adamhaile @yelouafi @Bobris

Bobris commented 7 years ago

I have peace with any Vanillajs implementation. Actually I think it is perfectly ok to "cheat" as much as possible to see overhead of our libs to ideal DOM manipulation. I hope there are no devs who would not consider using lib just because they would think they could do better without.

leeoniya commented 7 years ago

the issue i have is that there would never be a situation in real life that explicitly dictates how things need to be reordered as in "take first and second-to-last and append to end", therefore a vanillajs implementation of exactly these actions is meaningless, because no one would write such a manipulation in real code. this means that the overhead introduced is also meaningless except when compared to other general reconciliation algos.

Bobris commented 7 years ago

I can imagine there could exist observable framework (think Mobx blended with Svelte) where swapping data in model would automatically swapped rows in dom. So this particular operation could have minimal overhead in script. Of course it would be probably very difficult to implement and would be slower in other operations, but who knows...

eavichay commented 7 years ago

I think that the purpose of this benchmark is to focus on rendering / memory usage / startup time. The "overhead" of state management X or data management Y are not part of the game. If a framework is bundled with a state management, it's either helps it or not.

hville commented 7 years ago

Dropping vanillaJS and using the fastest implementation reduces the maintaining efforts (code quantity, setup and benchmarks times).

I am assuming that the purpose of this benchmark is to help developers explore and test ideas (because creating 1000 rows on a macbook is not really a good criteria to select a framework) and to that end, the vanillaJS implementation does not add much insights.

The only drawback I can think of is losing the fixed reference when experimenting with different frameworks or on different machines, so that baseline values are available without running the whole set. Then again, I guess most people compare implementations with popular frameworks instead of the vanillaJS results.

krausest commented 7 years ago

I don't consider dropping VanillaJS. The new result table allows to easily hide vanillaJS if one wishes. I believe there's a value in also measuring the cost of frameworks. In this case it shows that the cost is low. (I leave it up to you whether that translates to your real world app).

leeoniya commented 7 years ago

in that case we need to get vanilla working faster, it should not be possible to beat it.

adamhaile commented 7 years ago

My 2c:

There are only three tests with statistically significant wins over vanilla:

ivi on create 10k
bobril and hyperapp on startup

Ivi's win is probably real, but bobril and hyperapp's are almost certainly measurement errors. I ran bobril with STARTUP_DURATION_FROM_EVENTLOG set to true, and it fails sanity checks on every run. Startup times generally scale with bundle size, and bobril has a 44k bundle. A quick performance profile says that bobril is putting much of its initial rendering behind a setTimeout() call, and chrome is firing DOMContentLoaded before that setTimeout() triggers. (In case it's not clear, I'm not at all knocking bobril or hyperapp, just something about their startup defeats the measurements.)

I would propose:

rename vanillajs to baseline. It's not really, nor should it be, how you'd write the app without a framework, aka a vanillajs implementation. It should be the fastest known way to get correct behavior at the DOM level.
optimize it even further. It might not even need data models, just work on the DOM directly.

To put it another way, vanillajs/baseline should be the rabbit we greyhounds are chasing, NOT the fastest greyhound. It can be a different kind of thing.

adamhaile commented 7 years ago

Small followup: all the keyed frameworks with faster startups than vanillajs in Stefan's tests fail the sanity check: bobril, hyperapp, elm and vidom.

leeoniya commented 7 years ago

nice finds!

once the measurements are fixed and vanillajs scores 1.0 across the board, a compromise can be to provide an checkbox in the results table to exclude vanillajs/baseline. i would be happy with that.

krausest commented 7 years ago

Oh no - that means I must get the original startup benchmark running on Leon's machine :scream: @adamhaile What OS are you running? Have you ever seen issues like in #286?

leeoniya commented 7 years ago

feel free to set the default of STARTUP_DURATION_FROM_EVENTLOG to true. or use OS-sniffing at runtime to set it to false on Windows. i'll double check to see if Bobril's startup makes more sense with false for me.

krausest commented 7 years ago

Nice. Please update to 25dc718b32dfb57098989ecce0b6da472595fb63 before checking. I accidentally committed a condition to always force the situation you ran into.

Bobris commented 7 years ago

Yes Bobril does not synchronously creating content. It waits for first timer. But additinaly to workaround Chrome bug which display content before applying dynamicaly created css styles - there is content hidden by body.style.opacity=0 for 150-200ms (this test does not use this feature, so I would not need to apply this workaround so eagerly). But I am not even sure this is problem you found, or just first timer delay is problem, which good behaviour IMHO, but I understand it make measuring diffucult.

leeoniya commented 7 years ago

the slowest startup is 163.4ms. instead of figuring out how to get it perfect, why not simply wait 400ms before taking a measurement of the full timeline?

krausest commented 7 years ago

Not really sure if it helps, but I added a sleep in 7c341e72254708aea5ffe6af595ba6f998f2ecc3 and switched back to measuring startup from the timeline. Looking for hearing how it works out. Please note that "WARN: soundness check failed. reported duration is much bigger than JS comparison" is really just a warning that will be removed when I'm confident enough the metric is correct.

adamhaile commented 7 years ago

@krausest I'm on a new MBP2015 running MacOS 10.12.6. I can bootcamp to windows to try the benchmark too if you want.

I never saw the #286 error about no paint events, though I did used to see the error something about too many listeners? I think you fixed that.

In case it's useful, this is the kind of sanity check failure I see with bobril. (I had removed the true === true part of the condition too, fyi :)).

*** duration 58.262 upper bound  69.814 durationJS 32
WARN: soundness check failed. reported duration is much bigger than JS comparison 
{"type":"runBenchmark","ts":598032703815,"dur":0,"end":598032703815}
{"type":"navigationStart","ts":598032715367,"dur":0,"end":598032715367}
{"type":"paint","ts":598032748373,"dur":32,"end":598032748405,"evt":"{\"method\":\"Tracing.dataCollected\",\"params\":{\"args\":{\"data\":{\"clip\":[0,0,1200,0,1200,726,0,726],\"frame\":\"0x14f8be7c1df8\",\"layerId\":28,\"nodeId\":7}},\"cat\":\"devtools.timeline,rail\",\"dur\":32,\"name\":\"Paint\",\"ph\":\"X\",\"pid\":95122,\"tdur\":32,\"tid\":775,\"ts\":598032748373,\"tts\":309661}}"}
{"type":"gc","ts":598032773181,"end":598032773181,"mem":3.9166107177734375}
{"type":"paint","ts":598032773443,"dur":186,"end":598032773629,"evt":"{\"method\":\"Tracing.dataCollected\",\"params\":{\"args\":{\"data\":{\"clip\":[0,0,1200,0,1200,726,0,726],\"frame\":\"0x14f8be7c1df8\",\"layerId\":28,\"nodeId\":7}},\"cat\":\"devtools.timeline,rail\",\"dur\":186,\"name\":\"Paint\",\"ph\":\"X\",\"pid\":95122,\"tdur\":184,\"tid\":775,\"ts\":598032773443,\"tts\":334321}}"}

leeoniya commented 7 years ago

ok, i'm back with some numbers, observations and attachments. i benchmarked vanilla, ivi, bobril, dio domvm, surplus.

good news: no crashes this time and vanillajs is now in first place.

observations:

all impls had some qty of runs with soundness check failed, some perhaps as high as 1 of every 2 runs.
bobril was the only one which had all runs with soundness check failed
using median instead of mean in the results table matches expectations. (with mean vanillajs is 1.09)
bobril median is 2.89, which is way too high (as is the mean)
standard deviations are very high for all libs

i had to re-run some benchmarks with npm run selenium -- --benchmark startup --count 20 --framework ivi bobril surplus-v0.5.0-non-keyed to get them to make some sense.

from the symptoms, i think that this bench does the wrong thing when soundness checks fail. rather than getting thrown out, they start contributing to the standard deviation and inflate the means. for bobril, this spells death on my machine because not a single run happens without a soundness warning.

attached are the results from the initial run, the followup 20 count correction run, and the log from the correction run.

results-log.zip

leeoniya commented 7 years ago

i did another round of testing. on my laptop this time, which seems to be less susceptible to soundness check failed, though not immune.

the value of STARTUP_SLEEP_DURATION has no effect, regardless of STARTUP_DURATION_FROM_EVENTLOG
due to this laptop not emitting nearly as many soundness check failed errors, i had to re-run fewer (but not 0) frameworks to get the stddev down to <= 5ms across the board (besides Bobril, which is at 35ms).
Bobril continues to evade detection. with STARTUP_DURATION_FROM_EVENTLOG: true, i get soundness warnings and crap numbers on every run (as before). when the value is false, i get a nice low stddev, but it "outperforms" vanillajs by 12%.

if it weren't for Bobril (and maybe untested others) beating vanilla, the ideal setting for Windows appears to be STARTUP_DURATION_FROM_EVENTLOG: false and STARTUP_DURATION_FROM_EVENTLOG: 0. This gives the smallest stddev across all libs, close-ish means/medians, expected ordering and no need to re-run frameworks due to excessive soundness check failures.

startup0

Bobris commented 7 years ago

I don't really want you to impose on you so much additional work. Let me send pull request later today that will render first frame synchronously. Hope it should make it better.

leeoniya commented 7 years ago

@Bobris i'm gonna check the others that @adamhaile mentioned, if they all behave predictably on Windows with STARTUP_DURATION_FROM_EVENTLOG: false, then a PR like this would be very welcome 👍.

given the frequency of soundness fails and eventlog/true (and apparent non-discarding of these results), i'll just use false to void having to re-run the benches to get tight stddevs.

eavichay commented 7 years ago

Tests can be cheated for example: the update benchmark (replace data in every 10th row) implementation in vanilla "knows" what was updated and modifies only 10th row (it skips checking for changes in 0..8 in all partitions). I don't really mind about how vanilla performs as it is sort of a hook point to a unicorn performance. In order to disable possible cheats - choosing random 100 out of 1000 numbers and updating those would make it impossible to predict.

leeoniya commented 7 years ago

@eavichay this was why i opened this issue. however, most implementations that get submitted get a sniff test. it's pretty easy to spot "cheating" by simply looking for anomalies in the individual metrics. we have enough data at this point to know what to expect and anything that does sufficiently better than expectations is caught and scrutinized pretty quickly.

krausest commented 7 years ago

@leeoniya I consider it a good sign if the test driver reports the sanity check warning for frameworks that perform async initialisation. It's simply comparing W3C timing information to the duration of navigation start event until the last paint happens. It's clear that this sanity check can not hold for async initialisation of if visibility would be toggled by a timer. (OTOH the latter metric maybe could be tricked if all initial html was static and javascript loading was async)

What the startup benchmark really should do is accumulate all v8 compile and parse durations. But I have not really a clue how to extract that data from the timeline. Any pointers would be welcome.

leeoniya commented 7 years ago

without really digging into the resources [1], [2] or the bench code, what i'm seeing is that the final Composite Layers event from the devtools' event log is a pretty good measure of the "consistently interactive" metric. if there was a way to capture just this final event (after a predefined timeout of maybe 500ms), then it would likely work well.

compositelayers

[1] https://gist.github.com/axemclion/079f8bf1a997e9cfe9f0#paints-styles-layouts-and-other-events [2] https://sites.google.com/a/chromium.org/chromedriver/logging/performance-log

krausest commented 7 years ago

The code is almost there yet - you can choose which events should be considered paint events here: https://github.com/krausest/js-framework-benchmark/blob/7c341e72254708aea5ffe6af595ba6f998f2ecc3/webdriver-ts/src/benchmarkRunner.ts#L38-L42 The benchmark driver later computes the duration from navigationStart to the last paint event (which is whatever is active in the conditions above)

I had played a bit with which events to extract and found Paint to be most useful.

Please note that the last paint event is very close to last composite event. But certainly we could consider both "Paint" and "CompositeLayers" in future.

When I use composite layers instead of paint I get: result bobril-v8.0.1-keyed_30_startup.json min 38.97 max 61.284 mean 48.938599999999994 median 49.061499999999995 stddev 7.149782992511031 Whilst for using paint: result bobril-v8.0.1-keyed_30_startup.json min 40.819 max 61.184 mean 48.67530000000001 median 46.964 stddev 6.435412730353819

leeoniya commented 7 years ago

i'm slightly confused then. in the log i attached, shows the soundness check failed by asserting excessive duration values when they're ~60ms+ vs durationJS of ~30ms. however, when i manually check the timelines, it's actually the bigger value that always makes sense and aligns with the last paint/composite events. in fact, it's the large duration values which appear to be the correct ones, not the much lower durationJS.

is it possible that in cases when duration approximates durationJS, you're just somehow picking up the wrong paint event (not the last one)?

krausest commented 7 years ago

durationJS is just computed as window.performance.timing.loadEventEnd - window.performance.timing.navigationStart Maybe we should forget about this one and remove the sanity check and focus on measuring navigationStart to last paint event if we all can agree that this is a reasonable startup metric.

I hope and think I do not choose the wrong paint event: https://github.com/krausest/js-framework-benchmark/blob/7c341e72254708aea5ffe6af595ba6f998f2ecc3/webdriver-ts/src/benchmarkRunner.ts#L197-L202

Can you try to uncomment all paint events and check whether results are more stable on your machine? https://github.com/krausest/js-framework-benchmark/blob/7c341e72254708aea5ffe6af595ba6f998f2ecc3/webdriver-ts/src/benchmarkRunner.ts#L38-L47

leeoniya commented 7 years ago

no luck. still seeing a bimodal distribution where a lot of results are duration ~ durationJS (both too low) and duration > durationJS with duration being close to the truth (compared vs manually examined event logs for the last paint/composite events)

i wonder if we simply throw out the results where duration ~ durationJS, do we get the expected results and low stddev from the high durations alone. i suspect yes. log & results from 20 runs:

results-log-2.zip

this is from my work desktop that emits a lot of soundess errors. i'll test later on my laptop and see if most of the soundness checks disappear, making this strategy unreliable.

adamhaile commented 7 years ago

@leeoniya - random thought: any chance the late paint events could be caused by your mouse being over the window? Lots of :hover css rules ...

probably not, but I figured worth asking.

leeoniya commented 7 years ago

unlikely. i just let it run and walk away. i'm not doing stuff while a bench is running.

adamhaile commented 7 years ago

@leeoniya I've seen differences from my mouse being left over the area the chrome window is popping up in, as it still triggers the :hover rules. Like I say, it's not the likely cause, but worth mentioning.

Can you get Time To Interactive from chromedriver? Would that help here?

leeoniya commented 7 years ago

Can you get Time To Interactive from chromedriver? Would that help here?

that's my ultimate goal really. i don't know if this is directly available via the api.

TBH, i don't know what numbers to compare anymore. VanillaJS in manual testing has 10ms-12ms scripting time, while durationJS in the console says 49-52ms. it's not just a little off. could it be the result of some excessive tracing enabled in chromedriver in this bench? ¯\(ツ)/¯

i'll probably make a minimal testcase and start barfing the full unfiltered eventlog to see if anything makes sense.

krausest commented 7 years ago

alexfmpe recently submitted a PR to run chrome headless (--headless option). That should at least eliminate all mouse over issues. Doesn't reduce stddev a lot on my machine, but maybe it helps on your.

not headless: result bobril-v8.0.1-keyed_30_startup.json min 41.75 max 61.794 mean 52.89880000000001 median 53.9165 stddev 7.735270917039687 result domvm-v3.2.4-keyed_30_startup.json min 38.939 max 44.37 mean 41.591 median 41.409499999999994 stddev 1.7720421552547776

headless: result bobril-v8.0.1-keyed_30_startup.json min 32.959 max 48.019 mean 41.5656 median 44.283 stddev 5.458740700930938 result domvm-v3.2.4-keyed_30_startup.json min 31.384 max 37.7 mean 33.8732 median 32.942499999999995 stddev 2.2012529977265216

leeoniya commented 7 years ago

headless made no difference :(

at the end of the day, i'm not necessarily interested getting all things to reproduce on my machine, so long as the official results are low stddev, measure the right thing and produce the results we expect. if i can reliably know that vanilla really is the fastest startup impl, i can then test against that locally (which is reproducible ) instead of bobril's, hyperapp's or some other lower number that's faux.

i wonder if @paulirish or @addyosmani can provide any guidance on reliably measuring TTI via chromedriver.

paulirish commented 7 years ago

@leeoniya well I guess you nerdsniped me. :) See https://github.com/krausest/js-framework-benchmark/pull/326 .. it's a PR to introduce Lighthouse to the benchmark to calculate additional metrics.

I don't know that this will fix your stddev issues, but at least you can try things with our (fairly mature) metrics. They are all still imperfect, but summarizing main thread impact on the user is a tough one. :)

leeoniya commented 7 years ago

@paulirish huge thanks for the PR. i figured you might have something to say about the accuracy of a juicy benchmark :)

i'll wait for @krausest to validate & merge everything on macos/linux before trying it out on my windows machines 🤞 . (some of us do need a break from endless benchmarking :D)

krausest commented 7 years ago

Please use #330 to discuss which lighthouse metrics could be used to get a more useful startup metric. The discussion here got a little carried away from its original topic (though it was a good one for sure).

krausest / js-framework-benchmark

Is an imperative vanillajs a reasonable baseline for a data-driven bench? #316