chrome 124: weird results

krausest commented 2 weeks ago

Management Summary:

Chrome 124 results don't look right
The test runner yields results that differ from manual testing for create 1k rows (see comment below)
Frameworks that use RAF might be reported as faster than frameworks without RAF (see imba below). I didn't look into that issue in detail. The create 1k rows issue is currently more important to me.

Details:

A first run for the keyed implementations show results that don't look good. The fastest frameworks score now 1.05. In the history this has been a bad sign that something might be odd with the measurement. The fastest implementation for clear row was imba: Screenshot 2024-04-20 at 19 48 18 Imba uses RAF to schedule the rendering. In the screenshot above we'd compute 14.95 msecs for the duration (from click event to the end of the first commit event). Please note that there's a separate hit test and commit event later in the trace that doesn't count for our computation at least at the moment.

Vanillajs has only one commit event and looks more straight forward: Screenshot 2024-04-20 at 19 47 55 So here we'd take 17.38 msecs for vanillajs.

For chrome 123 we measured 15.3 for vanillajs and 16.7 for imba. With chrome 124 the results look different:

The boxplot looked like that for chrome 123:

krausest commented 2 weeks ago

If I add a RAF to vanillajs I get results pretty close to imba and I get a second commit event. Screenshot 2024-04-20 at 20 20 59 Maybe it was a wrong decision to count to the first commit and ignore the second commit in this case. It seems like that second commit is a consequence of the RAF call (and not something spurious) and should be included in the measurement.

krausest commented 1 week ago

Ran the benchmark for the keyed frameworks again last night. At least results are very consistent ☹️ Here's a screenshot: Screenshot 2024-04-22 at 20 41 21 or here if you want to look at the table. It'll take some time for further analysis.

krausest commented 1 week ago

Here's a look at create rows: A rather normal vanillajs trace: Screenshot 2024-04-22 at 21 13 07 The fastest ivi trace: Screenshot 2024-04-22 at 21 12 56

This wasn't the case with chrome 123:

With manual testing I haven't managed to get faster results for ivi, it looks more like chrome 123.

Using puppeteer as the test runner reports 34.1 msecs for ivi and 37.3 msecs for vanillajs. Webdrivercdp reports 33.9 msecs for ivi and 33.8 for vanillajs, which is more consistent with manual testing.

Other tests (so far it was the same browser windows, but a new tab per run): puppeteer with a new browser window per run: 33.4 for ivi and 36.2 for vanillajs. puppeteer running the create bench in one tab: 33.6 for ivi and 34.3 for vanillajs

Manual testing gives me 38.5 msecs for ivi and 36.8 for vanillajs. Something is wrong here...

antonmak1 commented 1 week ago

@krausest Maybe this has something to do with the HTML and DOM changes in version 124? Specifically, "Document render-blocking" or "setHTMLUnsafe and parseHTMLUnsafe" or something like that?

krausest commented 1 week ago

I took a closer look at the create 1k rows issue - but I have no good news: I ran a manual test for create 1k rows with 8 runs for ivi, solid, doohtml and vanillajs. This is what a boxplot looks like for this manual test. I'd consider that to be the ground truth:

Please note that there's a suspicious outlier for vanillajs with 34.29 msecs. I kept a screenshot of that trace. There's a quite clear ordering between the other three: doohtml < ivi < solid

What's interesting is that the chart above is remarkably similar to the puppeteer chrome 123 results:

However the results for all testdrivers are disappointing. I performed three runs for each of them:

Puppeteer: "clearly ivi is fastest"

puppeteer124 At least results are repeatable though not similar to the manual results.

Playwright: "solid is fastest"

playwright124

Webdrivercdp: "where is my mind?" The first run fits to "doo < ivi < solidjs" though vanillajs is closer to the outlier from the manual testing above, but run two and three just look random.

Currently I'm out of ideas. I don't see how I could publish chrome 124 results soon.

antonmak1 commented 1 week ago

@krausest Perhaps, if the old code works somehow wrong, then first of all, it needs to be tested together with the new code, which needs to be written based on version 124. It is clear that some kind of nonsense is coming out, that all the results now start from 1.05 and those frameworks and libraries that were plus or minus in one place are now 10 positions ahead, then 10 positions behind - this is nonsense. How can 7+ releases be adequate, but then everything breaks with the new one? This means that there is definitely a new error somewhere in the code that was not an error before or was not considered one.

mksunny1 commented 1 week ago

I will add my mind here. See if I can come up with something. I am not too familiar with this yet.

krausest commented 1 week ago

There's not much code that was updated: https://github.com/krausest/js-framework-benchmark/commit/30c247a897cfe147a94ff82edafd8d066f4ffb36 I also tried to update all the dependencies, but it doesn't change anything so I rolled them back. So I really think the factor that makes the difference is chrome 124.

I tried if windows reports better results, but it doesn't look like it does.

krausest commented 1 week ago

Just when I finished my plan what I'd do after quitting this project I tried one last thing 😄 What if I add a sleep call of a second before each interaction? Turns out that this results in the following charts for three independent runs for puppeteer: slow_runs Those charts look much closer to manual testing (and of course as old as I'm feeling that second between clicks seems realistic).

Now I have to find out:

How does the sleep impact the chart of the other testdrivers?
Where do we need the sleep calls? (hopefully not at many places)
What's the minimum duration for the sleep?

krausest commented 1 week ago

1: I don't see that the delay helps the other test drivers. playwright with a delay of 1 second: playwright_slow Looks different from the chart without delays, but not close to the manual testing result.

Nor does webdrivercdp:

webdrivercdp_slow

So we'll stick with puppeteer the other drivers do not report values closer to manual testing.

krausest commented 1 week ago

Chrome 124 is bizarre. I tried first adding sleeps between all interactions and then trying to filter out where they are actually needed.

I got that chart with puppeteer: Screenshot 2024-04-27 at 14 41 33

Look how bad those numbers are: > 50 msecs. That's ridiculous. Here's a trace for one such bad run: Screenshot 2024-04-27 at 14 45 47 And here's the trace file: doohtml-keyed_01_run1k_5.json The trace looks right, the computation of the duration is OK, there's no RAF, no GC, nothing suspicious. But scripting 5 ms and rendering 48 ms is just incredibly bad.

It turns out that one sleep call causes that bad performance: Before runBenchmark we call forceGC. If we sleep after that we get the bad performance: Screenshot 2024-04-27 at 14 55 06 Uncommenting the line (which just sleeps for a second via setTimeout in a promise) causes the bad performance. Without that sleep: Screenshot 2024-04-27 at 14 57 51 Just 37 msecs. doohtml-keyed_01_run1k_2.json The script duration is now 2 msecs and the rendering duration 37 mecs (!).

Found out something: This can be resolved by adding the trace category "disabled-by-default-v8.cpu_profiler". When enables were below 40 msecs with the sleep after forceGC().

krausest commented 6 days ago

The disaster is complete On the right are chrome 123 results. The left four charts are chrome 124 with varying sleep calls after every click during initBenchmark and before and after forceGC. Chrome 123 used no sleeps and thus should correspond to no sleep. I've painted a baseline since the axis is different for the charts. Sleep with 1 second creates the chart most similar to chrome 123, but is slower that the others and chrome 123. Basically I'd say that the fastest results should be considered most correct (I mean Vmax means here t_min, no matter how it's achieved).

Not sure how to get out of here. Currently I see no chance that chrome 124 allows me to measure with any confidence.

Thanks to https://github.com/GoogleChromeLabs/chrome-for-testing#json-api-endpoints I was able to download chrome 123 and check if it's really chrome 124 that causes this effect. So here's the same chart with three different delays for chrome 123: chrome123_delays This looks fine (order is preserved no matter what sleep. 1 sec sleep is a bit faster than no sleep, though 100msecs for create many rows is a little odd, but nowhere near to what chrome 124 doesn), but doesn't help us with chrome 124.

titoBouzout commented 5 days ago

May I suggest to try chrome 126 (chrome dev channel), to see if the problem persist? I'm not sure if worthy but you can also try beta and canary channels. Maybe they already fixed something? What about experiments, last time I read about experiments you cannot turn them off.

krausest commented 5 days ago

This comment serves just as a summary and will be linked from the chrome issue report. It contains no new information to the above.

With chrome 124 I got strange results for the benchmark that can be be seen in the chart below: chrome_124_disaster The results for chrome 123 on the right should be seen as a baseline. The chart "no sleep" is using the same code as chrome 123 and should be identical, but it is far off! I actually performed the same benchmark with chrome 124 manually by clicking and extracting the duration from the timeline in chrome and it gives results close to chrome 123:

I tried if delaying the actions from the benchmark driver helps and indeed adding a sleep of one second gives an order similar to chrome 123 but at a different speed. Varying the sleep duration makes the ranking of the frameworks arbitrary: ivi can be fastest, slowest and third. There's a chart above that shows that chrome 123 are stable for those sleep durations.

The duration for the benchmark is measured via traces from the click to the paint commit event: Screenshot 2024-04-29 at 21 00 55 For chrome 124 with 500 msecs delay we get something like 38.81 for one benchmark run Without sleep it takes only 34.78 msecs. Screenshot 2024-04-29 at 21 01 35 Please note that the difference comes from rendering duration, though rendering shouldn't behave different when some sleeps delays are added before the click event.

krausest commented 5 days ago

Chrome issue report is here: https://issues.chromium.org/issues/337900449

krausest commented 5 days ago

@titoBouzout I tried that before, but didn't keep the result. Anyways I repeated the run. It looks like that without sleep:

Just as bad as chrome 124 @localvoid Any idea why ivi is most impacted?

localvoid commented 4 days ago

@krausest I am not sure about other libraries, but I think that the only difference (DOM operations) between ivi and vanillajs is that when table is cleared, it removes rows by replacing <tbody> DOM node with a new one instead of removing rows with textContent=""

syduki commented 4 days ago

Very exciting "issue" indeed :smile:. But seriously, this should have happened soon or later, it is doubtful a Chrome 124 issue. It is naive at least to expect benchmark consistency between browser releases when there is no separation for "Olympics" and "Paralympics" while they run together on the same marathon, also when benchmarking two different kinds of "manufacturing processes", like "stamping" and "handcrafting", which is the very case when some frameworks are using innerHTML to build the DOM tree and others - createElement.

To me, the results look pretty predictable, if considering the recent effort put into HTML reviving, thus HTML optimizations. It is logical that innerHTML should be faster than createElement, indeed as it was once. As for the case with "sleep", it can be explained from the point of view of HTML parser cache optimization, where no-sleep would benefit from cache-hit, and conversely, a sleep would cause a cache-miss.

It seems to me that we will see the same "issue" in the future versions of Chrome, so my suggestion is to publish the results as they are, for the sake of history.

krausest commented 4 days ago

@syduki I don't agree at all. I didn't expect chrome to perform equally across versions (and that wasn't the case in the past). But when performance changes between versions there should be a reason we understand (like new layout engine etc.).

Performance within one chrome version must be consistent, i.e. one must be able to validate results from automated testing with manual tests. This is not the case for chrome 124. The automated test reports that ivi is fastest but manual testing shows no evidence that it really is (and manual testing yields indeed pretty much the same results as chrome 123). And of course adding some delays in the warm up and before running the actual benchmark must not influence the measured duration.

syduki commented 4 days ago

@krausest Well, I can just admit that we have very different perception of benchmark consistency. Here, I consider consistency to be relevant in a narrow field only, depending on specific framework/scenario/environment, not that it should be universally comparable between any arbitrary mix of those. It is obvious that optimizations for different methods of layout creation are different, thus, I am not expecting inconsistencies only on major changes in browser, i.e. breaking changes that affects all frameworks (layout engine), and disregarding the changes/optimizations in a specific subroutine which may not necessarily be a breaking change nor a subject for public report, but may affect only a subset of frameworks. As I already alluded, I think this very issue is somehow related to the latest changes/optimizations in the markup handling. It could be that recent addition of setHTMLUnsafe somehow affected the innerHTML handling as they share the same underlying code. I didn't dig deeper into that code, but this source code makes me to believe that such kind of optimizations are used now more aggressively. Regarding the manual/automated tests, I am agree that they should exhibit the same behavioral results but don't get why one should expect the same performance results/consistency when actually these are different test environments.

krausest commented 2 days ago

I'm really stuck because I have no confidence in the chrome 124+ results, but I have a proposal to resolve that: If we achieve to create a new vanillajs implementation that performs create 1,000 rows, replace rows and create 10,000 rows faster (or as fast) as ivi in the no delay scenario above I'll gain back confidence. If we fail I claim that chrome reports incorrect values. @syduki @trueadm @localvoid or anyone else: Any chance submitting a new vanillajs version? (One more rule: Just cloning ivi and using that as a lib wouldn't count 😄 ) I opened an issue for that: https://github.com/krausest/js-framework-benchmark/issues/1661

krausest / js-framework-benchmark

chrome 124: weird results #1655