Rethink what to benchmark

ooflorent commented 3 years ago

Currently the suite is far from perfect because it does not highlight the strengths and the weaknesses of each library. I tried to list what could affect the performances:

Number of components
- Some libraries use int32 bit masks which prevent the usage of more than 32 components. While it provides huge speedups over its alternatives, it makes the libraries unusable for large games.
- Another issue is that some libraries are slowed down by the number of registered components.
Number of queries
- Most implementations (archetypal and grouped) are affected by the number of queries. Having 50+ queries can dramatically affect entity creation/deletion or adding/removing components.
Entity distribution
- Packed or fragmented component distributions affect the query speed. This is not highlighted in this benchmark.
- The number of entities can favor or knock some implementations. Iterating a total of 1K entities could be faster than finding the right archetype to match 10 entities.

In the end, what should be measured? It is not realistic to benchmark 50 components, 50 systems, 5K entities, 50% of entity mutated, 200 creations/removals per tick. I'm not even discussing the data types (bitecs cannot store closures or external objects within components) or what's happening inside the systems. For example, perform-ecs does very few object accesses (~1 per entity) and would be considerably faster than an implementation doing 2 accesses per entity. However, what's happening inside the systems could be significantly slower than the query.

I would like to gather feedback from various ECS authors to tailor a fair benchmark. The goal is not to compete but to help the user to pick the right implementation for their need.

cc @NateTheGreatt (bitecs) @ddmills (geotic) @stasm (goodluck) @fireveined (perform-ecs) @ayebear (picoes)

NateTheGreatt commented 3 years ago

Some background/insight: I built bitecs specifically to be used inside of a multiplayer and multithreaded game engine, although the performance it gives also enables single-player games with massive entity counts / iterations. Iteration performance is the absolute focus. Number of components, number of queries, and entity distribution should ideally have minimal effects on the overall performance of bitecs, though I will investigate why performance is hindered in a few of the newly added scenarios.

Pros:

All state stored in TypedArrays
- Iteration performance
- Enables game state to be serialized/deserialized with speed and ease for networked implementations
- Can pass buffers to workers, enabling a multithreaded implementation for vertical scaling (WIP)
- Low memory thrash
- All memory preallocated
- Minimal amount of JS objects in use
- Small memory footprint
Uses infinite bitmasks
- There is no component type limit
Simple and easy to use API
Zero dependencies
Node or browser
Classless / functional style (personal preference)
Tiny (<3kb gzipped)

Cons:

Classless / functional style (some do not prefer this)
All state stored in TypedArrays
- Entities and Components are not JS objects
- Must transfer data from components to third-party library JS objects and back
- Makes it harder to integrate third-party libraries to assist with logic
Simple API may lack some desirable features of other ECS's

The essential tossup when choosing to use bitecs or not comes down to the fact that all game state is stored in TypedArrays, which gives a boost to iteration speeds and allows multithreading but changes the traditional semantics of entity/component data access within a system. This also makes it a little more difficult to integrate third party libraries into system logic. Personally, I use very few third-party libs in my system logic, so this is not as big of an issue for me.

If you want to build a multiplayer game, or a game with high entity counts / iterations, I would encourage the use of bitecs over other ECS's due to the performance gains (we achieved 1k+ CCU at 30 tick with our custom network engine built with this ECS, single-threaded). If you don't need the performance and are looking for cleaner syntax, easier integrations, and more traditional data-access, then another ECS may be the better choice.

As far as what sorts of benchmarks would be valuable: How about a simple game as a benchmark? We could create something akin to that famous ToDo app that everyone rebuilds on tech stacks and then bench different aspects of this hypothetical game.

ooflorent commented 3 years ago

How about a simple game as a benchmark?

I completely agree with this but the task is titanic! I think someone should propose a PR targeting its implementation so we could discuss around it. Then, we will try to implement it for every other libraries.

stasm commented 3 years ago

In my experience, it's hard to agree what a "simple" game should look like. If it's too simple we'll risk not testing a real-life scenario; if it's too complex, then implementing the benchmark for every library will be a lot of work and will also involve a lot of code not directly related to the benchmark (e.g. rendering and efficient collision detection for large entity counts).

Games can also vary in performance depending on the user's input. Ideally, the benchmark would be input-agnostic and only consist of deterministic AI-driven logic. My first thought is to try to implement a tower-defense-like scene where both the attackers and the defenders are controlled by the AI. I'd also advocate for making it grid-based to avoid costly collision detection at larger entity counts, and to skip rendering (but maybe emulate it to some extent, e.g. by calculating light positions).

Some libraries are designed to perform well when used to create simple games (goodluck is one of them) while others can efficiently scale to larger projects, at the cost of being slower in the simplest scenarios. There could be two workloads in the benchmark (small and large), or perhaps a single workload which would attempt to represent a typical scene in a medium-sized game fairly.

I can try to prepare a "spec" of a typical load of a typical scene from one of the games I created in goodluck. It would include things like:

the number of components,
the number of entities at the start of the scene,
the distribution of components across entities in the scene,
the frequency and number of new entities created from other systems (e.g. projectiles),
the average lifetime of those entities,
the number of mutations (e.g. to update a projectile's position).

stasm commented 3 years ago

As for the benchmark, I'd also like to include at least one system which depends on the order of the iteration. In my recent goodluck games, there's sys_transform which updates entities by traversing the tree of parent-children relationships, rather than by iterating over the entities in their order of insertion into the world.

ooflorent commented 3 years ago

Games can also vary in performance depending on the user's input. Ideally, the benchmark would be input-agnostic and only consist of deterministic AI-driven logic.

That's the reason why a game isn't a fully viable solution. Something closer to a simulation would be better. A potential issue would be related to the number of game ticks: 3600 ticks represent a minute of playtime. The fastest implementations would complete the simulation in 10ms which would result in biased results (since the last iterations would update almost nothing). Including the setup inside the benchmark would also be unfair since many implementations are slow to be initialized. A alternative would be to reset the simulation after a fixed number of ticks.

updates entities by traversing the tree of parent-children relationships, rather than by iterating over the entities in their order of insertion into the world.

That would be interesting but I'm not sure if it is relevant. I don't expect any implementation do deliver entities in order. So the benchmark would be something like this:

Collect entities
Build the hierarchy
Traverse the tree

Only ① would be implementation dependant. ② and ③ would be the same every time.

ayebear commented 3 years ago

TLDR: Plot graphs from simulations

Whether it's a game, simulation, or benchmark, the number of entities, components, and systems will end up being the same. You also don't want to spend cycles outside of running ECS code, because then you aren't benchmarking the library speed, and results will be skewed based on how much time is spent in the simulation. However, I think the addition of real world tests would be great, but not as a replacement for the benchmarks.

I think some possibly useful data would be to benchmark many different combinations of counts of entities, components, and systems. Then you could plot graphs with this data and show asymptotic properties of how different operations behave between the frameworks, so people could choose the most optimal one for their project.

stasm commented 3 years ago

A month ago I created a toy simulation called PewPew. The code is at https://github.com/stasm/pewpew. You can see it in action at https://stasm.github.io/pewpew/.

PewPew screenshot

My goal was to create a scene with ca. 15 components and 15 systems, which would imitate actual game logic and which can run both in the browser and in Node.js. PewPew is based on Goodluck and clocks in at around 1,000 LOC.

I'm not 100% happy with the result and I'm posting this comment to gather feedback on how to design the benchmark better.

PewPew isn't currently deterministic as it uses Math.random() in a few places. This should be easy to fix; there's already a custom RNG built into it.
PewPew isn't stable over time: the entity count starts small and grows with time. This means that FPS will be higher a the beginning of the benchmark run than at the end and that a simple average of observed FPS might not be a good measure of performance.
In order to imitate some game logic, I implemented a collision detection system. I'm not sure this is the right approach; it complicates the code and takes up to 20% of the frame (drawing excluded).

Perhaps a better approach would be to:

Start with a given count of entities and only create new ones when old ones are destroyed.
Remove interactions between entities (via collisions) and instead have them react to:
- current position on the map,
- timer events,
- RNG events.

One of the challenges here is to design a benchmark with more than just a few systems (because we want to imitate a real game), and at the same to keep it simple (because we don't want to benchmark non-ECS code). How can we reconcile these two constraints?

noctjs / ecs-benchmark

Rethink what to benchmark #18