Sierpinski triangle demo is missing per-node expensive computation

vobyjs / voby

A high-performance framework with fine-grained observable-based reactivity for building rich applications.

https://voby.dev

MIT License

876 stars 22 forks source link

Sierpinski triangle demo is missing per-node expensive computation #13

Closed danielweck closed 2 years ago

danielweck commented 2 years ago

Hello, just a thought about the rendering performance demo:

https://codesandbox.io/s/voby-demo-triangle-l837v0?file=/src/index.tsx:222-385

const useSeconds = (): Observable<number> => {

  const seconds = $(0);

  useInterval ( () => seconds ( ( seconds () % 9 ) + 1 ), 1000 );

  return seconds;
};

I think that useSeconds() should be "wrapped" into an intermediary observable that triggers at the same time as the seconds signal which is scheduled to reliably fire at every seconds tick. The goal of this "wrapper" observable is to introduce a computationally-taxing operation, per node of the triangular structure (just as in the original React demo).

This is specifically what makes this stress-test relevant, as remarked by Ryan Carniato:

https://github.com/ryansolid/solid-sierpinski-triangle-demo#solid-sierpinski-triangle-demo

See:

https://github.com/ryansolid/solid-sierpinski-triangle-demo/blob/889db98d7697bd39d8fe3ca0e0134fb2788b4447/src/main.jsx#L46-L51

    var e = performance.now() + 0.8;
    // Artificially long execution time.
    while (performance.now() < e) {}
};

(sorry, I don't know enough about Voby to contribute a PR)

I used the same technique in my stress-test for fine-grain reactivity / observable lib experiment (I use uHTML for DOM rendering). Otherwise, rendering at 200 FPS is way too easy :)

fabiospampinato commented 2 years ago

Done, thanks 👍 To be fair I don't really understand the point of that demo as a benchmark, an artificial delay of 0.8ms every 1000ms is just irrelevant I think. I think the demo just looks cool 😂

danielweck commented 2 years ago

hello, yeah the CPU spike during each second tick didn't in fact make much difference in my stress-testing of uHTML + bring-my-own-reactive-lib ... however this triangles demo revealed a bug in my observer propagation algorithm (runaway memory leak which I didn't catch with unit tests) And as you say, the demo "looks cool" ;)

PS: I've reached a point after several design/implementation iterations where my micro-lib is slightly faster than Solid but still slightly slower than Kairo (benchmarked in Node, Firefox, Safari and Chrome). It uses a synchronous propagation pattern, unlike CellX which defers to a microtask (rather unpredictable + skips updates as batched by default). Anyway, this is just a spare-time fun experiment, the API surface is super small (only the fundamental primitives, no sugar) so probably not that useful for any real-world DOM fine-grain reactivity use (I use it to refresh Preact components ... so strictly speaking reactivity at the VDOM component level, not direct DOM expressions). I am also playing with uHTML which does a complete DOM render diff but uses cache/memo to optimise the render pass. The reactivity lib currently stands at just under 1KB (gzipped). Kairo is massive by comparison, super feature rich (and so damn fast!!). Solid reactivity core is great of course, and fine-tuned for its DOM use.

I haven't looked at Oby yet. I should add it to my Solid benchmark fork, not only to test performance but also consistency (propagation completeness and order).

Cheers, Dan

fabiospampinato commented 2 years ago

That's cool, we need more exploration in this space. I've been trying to reach Solid's performance in the cellx benchmark but I can't quite get there it seems. Solid does some things I wouldn't be comfortable with though, like not deduplicating registrations of signals with their owners, so you can just run out of memory by reading a single signal in a loop and in general you will consume more memory in normal code because of that detail.

I haven't looked into Kairo yet, it'd be nice if somehow Oby could be sped up some more.

danielweck commented 2 years ago

Thanks for your reply.

The CellX benchmark is of very limited interest due to how the "layers of reactivity" are structured ( https://github.com/Riim/cellx/blob/11ba12c4317534fce1c797dbad17006d7a72d82d/perf/perf.html#L159 ). The timing measurements become meaningful at around 3000-5000 layers, but at that point many implementations reach call stack limits, so the whole exercise becomes rather pointless :) (plus, the call stack limitations are sometimes unpredictably alleviated in subsequent runs due to JS engine optimisations / function inlining) That being said, just for fun I continue to run CellX's benchmark, mainly to stress-test my algorithm + implementation with this particular "layer" pattern. Note that CellX's "win" is largely due to deferring batched updates in a microtask, which causes undesirable unpredictable behaviour in real-world applications (notably DOM / frontend).

Crucially, I use Solid's own benchmark ( https://github.com/solidjs/solid/blob/main/packages/solid/bench/bench.cjs ) which more realistically represents real-world usage: isolation of creation vs. propagation, using a thoughtfully-crafted set of reactive graph "shapes" (i.e. mass-initialisation of observable units + computed derivations, and separate cost measurement for synchronous updates inside the reactive structure).

PS: there are significant discrepancies between JS runtimes: Node vs. Firefox vs. Safari vs. Chrome ... so, I always look at the results of synthetic / micro benchmarks with some scepticism ... including my own stats ;) Kairo is particularly super-fast in Safari / JavaScriptCore, notably in the creation phase ... I don't think that other runtimes de-opt code paths, so maybe there are some unintentional JSCore-friendly micro-optimisation in Kairo! :)

By the way, I modified Solid's bench.cjs to verify / assert the order of reactive updates, and I discovered that Solid, Kairo, vuerx and rval-mod are consistent, whilst s-mod, sinuous-mod and sinuous are not.

I am also using Solid's unit-tests (specifically: https://github.com/solidjs/solid/blob/main/packages/solid/test/signals.spec.ts and https://github.com/solidjs/solid/blob/main/packages/solid/test/signals.memo.spec.ts ) to verify that my code performs in a manner that is "compatible" with fine-grain DOM reactivity (i.e. order of updates, propagation consistency, etc.). My algorithm "disagree" with some of Solid's reactive update paths, but I am taking this outcome as "my implementation is very likely incorrect, and it will break in some not-yet-tested edge-case" ;o)

To be clear: I have an open-source playground / kitchen sink on GitHub, but I have no intention of publishing yet another NPM package. I take satisfaction in fine-tuning both the algorithm and its corresponding TS/JS code, whilst keeping the JS bundle size tiny (mostly for educational purposes, in my spare time, over food breaks). My implementation of core reactivity primitives seem to have reached a level of memory/CPU performance close to Kairo and slightly better than Solid ... but in the real world, the perf bottleneck probably narrows-down to the DOM / VDOM / diff / updates (which is a problem space I have absolutely zero desire to explore ... I am quite happy to rely on the extensive prior art produced by devs smarter than me :)

Saving a few milliseconds of code execution in the reactivity logic is "nice to have", it's by no means a game-changing factor in anything other than very niche application domains. I quickly realised that it is a futile pursuit. Better focus on consistency / reliability / predictability. For example, frustratingly Terser minification introduces a few milliseconds of performance loss compared with raw JS code (I presume that somehow Terser's var/loop jiggling triggers de-opt in some code paths). I also observe significant jumps / variations when executing benchmark runs in V8 (not so much in other engines), so the median and mean average figures only tell a certain story. According to Chrome's profiler, garbage collection is largely responsible for random perf gaps in the flame graph, so I have been focusing on memory allocations and disposal patterns.

Please let me know if you are aware of particularly gruelling "reactive data" performance tests, like massive spreadsheet resolver benchmarks :)

fabiospampinato commented 2 years ago

Thanks for the pointers, I think I agree with everything you said.

I didn't know that cellx's performance was due to batching updates basically, that makes me feel much better about my implementation then 😂, which is indeed meant to be used in a different context, where batching everything may not be desirable.

I don't know of a better benchmark unfortunately, in fact I was kind of looking for a better one myself, I think I'll just run Solid's too and see what happens. Cellx's benchmark was useful to me though to get a sense of the numbers I should be seeing, and to find some re-executions that could have been avoided, but it's very narrow in scope and not really representative of real-world scenarios.

Testing-wise I'm running Sinuous' test suite plus about ~170 custom tests at the moment. It might be interesting to run Solid's test suite too, just to be even more sure that things are working correctly.

DOM-wise if you change your mind potentially you could just wire your implementation with dom-expressions, or you could just fork voby, delete everything that you don't need, and provide a shim for the remaining functions that are needed from oby here, it could be fun.

danielweck commented 2 years ago

Thanks for sharing your thoughts.

The other day I was actually having a look at Ryan's @vue/reactivity wrapper for dom-expressions: https://github.com/ryansolid/vuerx-jsx/blob/master/src/lib.ts

Besides providing a useful example of API shim, this integrates a "modified version of mapSample from S-array[https://github.com/adamhaile/S-array] by Adam Haile", just as in Solid itself ( https://github.com/solidjs/solid/blob/main/packages/solid/src/reactive/array.ts ) ....another piece of the puzzle which was solved a while ago with good performance and reliability, so thankfully no need to "reinvent the wheel" here :)

danielweck commented 2 years ago

Re. CellX microtask updates batching:

FWIW, I was quite surprised to observe that my humble synchronous algo was faster than CellX's async updates:

5000 "layers" (beforeChange: [2,4,-1,-6], afterChange: [-2,1,-4,-4]) (2020 MacBookPro Intel i5)

(CellX JS bundle from CDN: https://unpkg.com/cellx@1.10.26/dist/cellx.umd.js )

Chrome:

CellX:

MIN: 9.800000000745058,
MAX: 12.300000000745058,
AVERAGE: 11.020000000111759,
MEDIAN: 10.550000000745058

Mine:

MIN: 4.799999998882413,
MAX: 15,
AVERAGE: 9.129999999701976,
MEDIAN: 7.349999998696148

Safari:

CellX:

MIN: 9,
MAX: 12,
AVERAGE: 10.100000000000183,
MEDIAN: 10

Mine:

MIN: 6.000000000003638,
MAX: 12,
AVERAGE: 7.900000000001091,
MEDIAN: 7.000000000003638

Firefox:

CellX:

MIN: 20,
MAX: 26,
AVERAGE: 23.2,
MEDIAN: 23

Mine:

MIN: 7,
MAX: 17,
AVERAGE: 13.6,
MEDIAN: 14.5

fabiospampinato commented 2 years ago

Even with no batching your is faster? That's pretty awesome! Maybe I'm really missing some optimization in my implementation 🤔

danielweck commented 2 years ago

more likely, your code triggers de-opt in JS engines.

in my experience the most obvious kind of perf-killing JS is holey arrays (index beyond length-1, or Array(N) pre-alloc), as well as non-monomorphic objects (shapes with varying property order and value types) ... which both cause costly opt/de-opt cycles.

on top of that, there are many micro optimisations such as "for of" iteration vs. old-school index loops (reverse vs. increasing index makes no difference anymore in modern JS engines), resetting arrays to undefined/null vs. empty [], object.assign() vs. spread syntax, array push vs. concat allocations depending on the anticipated total capacity, usage of slice() array views followed by unforeseen/unpredicted allocation due to changes in the underlying array, object.freeze/seal or object.create(null) which seem useful on paper but in practice totally kill perf. etc. :)

PS: my original sync implem. was even faster (consistent 4-6ms runs) but I really wanted the consistency / predictability of updates propagation so I accepted the trade-off. In other words, I had managed to optimise extremely for Cellx's particular benchmark, but ended up breaking other important creation+propagation scenarios.

fabiospampinato commented 2 years ago

I think maybe having the code organized into classes and split into many functions may be the main issue there. Though I can't really write it differently without tradeoffs, so I think I'm happy with that.

danielweck commented 2 years ago

Solid, Kairo, Cellx, etc. all use object-oriented programming style, yet they achieve good/excellent performance. So I think the performance optimisation lies elsewhere in your code. Besides the aforementioned potential de-opts, maybe your algorithm (or perhaps just its implementation) causes too many unnecessary array mutations and/or graph traversals? As you know the "difficult" part is to dynamically manage link relationships between nodes in the reactive graph (child dependencies and parent reactions of atomic observable units + computed derivations), in a memory and CPU-efficient manner. My algorithm is really quite simple (thus the small number of total lines of code for the core routines), but its implementation is a little tricky due to micro-optimisations like array re-use, index tracking, etc. (which make the code harder to read too! :)

danielweck commented 2 years ago

Going back to the "Sierpinski triangle" (this issue) 😅 ...

This really is a useful stress test ... when implemented correctly!

Normally with these kinds of "graphical" benchmarks (like your SVG clock 😃 ), I observe 200 FPS (5ms time budget per rendering frame) when the frame rate is not artificially capped by requestAnimationFrame() (i.e. 60 FPS / 16ms per frame, on a laptop display).

That's with uHTML doing a full top-down render pass at every frame, which (if I understand correctly) involves a cache of DOM "expressions" when processing tagged template literals, and a highly-optimised diff implementation.

The "triangles" demo adds an additional layer of stress-testing over mere DOM rendering, useful for asserting CPU and memory efficiency in a reactivity lib. I observe between 130-150 FPS under stress, instead of the usual 200 FPS. Note that visual stuttering can be caused by rendering frames not aligning perfectly with the optimal requestAnimationFrame() ... but in my tests on Firefox, Chrome and Safari, dropped frames were usually caused by inefficient update propagation in the reactive graph. This extreme use-case therefore helped me fine-tune my implementation.

In a nutshell, there are 3 sources of stress in this demo:

1) There is a single top-level observable number that records the current timestamp at each frame. This is used to calculate the effective rendering frame rate, in real-time (this also causes a render refresh to cycle the overall 2D scaling, which is a DOM mutation, not just a CSS property/var update). This primitive reactive unit fires at a high rate, but it is located at the root of the dependency tree. In this "triangles" demo, there is in fact only one computed / derived observable (i.e. the "root" itself), so note that this is a very limited test-case for a reactivity / DAG graph algorithm. In the SolidJS benchmark suite, this shallow tree is equivalent to instantiating many non-nested / flat children dependencies. 2) There are 729 "blobs" inside the recursive triangle structure. Each blob has its own observable primitive unit for mouse over (boolean on/off), but once again the layout is very shallow. The key aspect here is that the user can introduce additional stress by hovering the mouse rapidly over blobs, thereby triggering DOM mutations in quick successions ("layered" over the fast-refresh caused by (1), so depending on how renders are scheduled by the underlying DOM/VDOM engine, this might cause visible stuttering). 3) There is a single observable which "ticks" every second to update the number rendered inside each "blob" (also a DOM mutation). This tick event also introduces artificial CPU load by running a no-op loop.

It is the cumulated effect of all 3 points that may cause visible slow-down / stuttering with some frontend "implementations" (by which I mean the combination of a rendering lib / scheduling technique + "state management" code).

Codesanbox demo link: https://cnmehi.csb.app

Codesanbox IDE link: https://codesandbox.io/s/uhtml-obs-demo-triangle-forked-cnmehi?file=/public/obs.js

fabiospampinato commented 2 years ago

Solid, Kairo, Cellx, etc. all use object-oriented programming style, yet they achieve good/excellent performance.

Solid is more on the functional spectrum than Voby is in my opinion. Like things are organized into classes in Voby, but I don't see a single class in Solid. Methods written like that may be harder to optimize, I'm not sure they can be inlined (or inlined as much).

So I think the performance optimisation lies elsewhere in your code.

I think using sets for deduplicating is partly responsible for this, Solid just doesn't deduplicate registrations, other than that though I really have no clue, there's basically nothing that I haven't tried optimizing already. Maybe the codebase needs a fresh pair of eyes to look at it.

maybe your algorithm (or perhaps just its implementation) causes too many unnecessary array mutations and/or graph traversals?

In cellx's benchmark those seem optimal 🤔 Well there's only one thing that I spotted that it's not optimized, if a cell in the benchmark depends on a and b, and a = 3 and b = 5, then given the particular implementation of the cell it shouldn't re-render if suddenly a = 5 and b = 3 because the result will be the same, but that's basically impossible to detect by the library, and calling the computed should be very cheap anyway and it won't cause its children to be refreshed if the output value is the same.

I think the "issue" is largely using sets, deduplicating, having the code split into classes and calling many functions for anything.

its implementation is a little tricky due to micro-optimisations like array re-use, index tracking, etc. (which make the code harder to read too! :)

I don't think I'm doing either of those, are you keeping track of indexes like Solid is doing, and also not doing any deduplication on the registration of signals with their parents? I'm not sure where I could reuse arrays 🤔

danielweck commented 2 years ago

Hello, in my implementation there are no duplicate registrations (parent reactions and/or child dependencies). I use array.includes() instead of array.indexOf() but performance is roughly identical. I don't use Set, notably because iterators are expensive compared with old-school looping over array indices. The array of child dependencies is re-used in order to avoid unnecessary costly allocations during the execution of a computed derivation (in other words, when signals read inside computations are the same as in the previously-executed derivation, the array remains identical so no need to reset+reallocate eagerly, might as well do it lazily, only if needed).

fabiospampinato commented 2 years ago

Hello, in my implementation there are no duplicate registrations (parent reactions and/or child dependencies). I use array.includes() instead of array.indexOf() but performance is roughly identical.

Calling indexOf/includes on an array of arbitrary length can backfire performance-wise though 🤔 with a set this check happens in constant time. Maybe in real world scenarios indexOf/includes is not a problem usually, but I'm not sure how I feel about it in general.

The array of child dependencies is re-used in order to avoid unnecessary costly allocations during the execution of a computed derivation (in other words, when signals read inside computations are the same as in the previously-executed derivation, the array remains identical so no need to reset+reallocate eagerly, might as well do it lazily, only if needed).

That's interesting, I'm doing that too, but only if the observer had only one dependency prior to being re-executed, maybe I should just do it always. I'm trying to avoid calling indexOf/includes as much as possible though, as those can get out of hand in edge cases.

fabiospampinato commented 2 years ago

@danielweck Do you have a link to your project? I'd like to take a look at it, I can't find in your github profile.

danielweck commented 2 years ago

My implementation is a single TypeScript file (design + code in constant flux):

https://github.com/danielweck/preact-wmr-twind-zero/blob/main/packages/preact-things/src/observant/core/index.ts

... which is located inside my GitHub "kitchen sink" (side project, frontend performance experimentations).

Pre-compiled/transpiled Javascript is available in my Codesandbox:

Sierpinski triangle:

https://codesandbox.io/s/uhtml-obs-demo-triangle-forked-cnmehi?file=/public/obs.js

SVG clock ;)

https://codesandbox.io/s/uhtml-obs-demo-clock-forked-41if2u?file=/obs.js

CellX benchmark (Oby fork):

https://codesandbox.io/s/oby-bench-cellx-forked-1jyvch?file=/observant.js

... that being said, Codesanbox is convenient but performance metrics are somewhat impacted by the IDE integration. For more reliable / consistent measurements, and for my own regression tests I prefer to use this simple webpage (shows CellX bench median / mean average, and runs the Solid bench as well):

https://danielweck.github.io/preact-wmr-twind-zero/perf-online.html

Finally, I regularly execute the NodeJS bench for regression testing:

https://github.com/danielweck/preact-wmr-twind-zero/blob/f01b4bd3b84f9aac8629eb7aa1addfd1ecbcc753/packages/preact-things/package.json#L109-L112

https://github.com/danielweck/preact-wmr-twind-zero/blob/main/packages/preact-things/perf.cjs

(this requires a local build so you'll need to pnpm i && pnpm lint && pnpm test && cd packages/preact-things && pnpm bundle && pnpm perf :)

fabiospampinato commented 2 years ago

Thanks for the pointers! I switched to always diffing dependencies rather than only if the previous dependencies contained one item, and it seems some ~5% faster in cellx's benchmark, though I traded that performance for cleaner code elsewhere already. The main sort of bottleneck for that benchmark seems just propagating stale/ready events, possibly largely due to my use of Sets.

fabiospampinato commented 2 years ago

I might have missed an optimization, maybe if an observable is being listened only by a single observer I don't need to send stale/ready messages across the tree? It seems like that should be safe 🤔