Why SuperTuxKart struggles with 10k scene nodes in a scene

MatthewsSam commented 5 years ago

Today I had the amazing opportunity to spend 1h with https://github.com/bfierz a professional 3D engine designer.

Preface

Together we had an overview of the rendering loop and we focused on why SuperTuxKart is so slow when it has almost nothing to render but a lot of invisible objects in a scene manager. We came up with a couple of solutions which could help alleviate most of the problems we have currently. It also became clear to me why the engine struggles with a lot of nodes and why all of our previous attempts failed.

Why it is important

Before I start going in depth, I would like to answer to why it matters what happens with invisible objects and why we need to fix it.

SuperTuxKart has no scene with 10000 objects. So why do we even bother?
- The reason why there is no scene with 10k objects is that we always knew that such a scene would slow the engine down (because of the enormous number of objects). So artists have been merging the scenes partially in Blender, which isn't optimal from the point of view of track creation.
- Things like grass/forest could be very easily done if we had an overall faster loop.
We are talking about invisible objects that aren't even rendered anyways, so why should we care?
- Yes, we should care. Having a scene with many invisible objects slows the engine down, which highlights a major issue with our scene manager. Therefore, any attempt to make the scene manager faster (by occlusion culling, etc) is pointless, because even hidden objects are slow.
- This opens up new possibilities of optimization, such as occlusion culling.
- It would also speed the game on lower end computers. We should take into consideration that not everyone has a GTX 970 / Core I9. Therefore, such optimization would help lower end machines.

Proposals

In the benchmark with 10k hidden objects, the majority of the time is spent in: DrawCalls::parseSceneManager. The proposal listed below won't have alone massive improvements. However, they will prepare the field to a much better / faster scene manager/render loop.

0 - To use a lightweight SceneNode to reduce its size in memory.

This is the core problem: The loop is slow because when the engine renders a scene, it goes through the list of every node, and each node is a big object containing multiple pointers to other things (like position, scale, ID, rotation, transformation, parentnode, children, animator, absolute transformation, some debug data, etc).

The situation is even worst when the engine deals with custom things or a specific setup, like the camera node (which adds more things as well). So, whenever we try to go through 10k nodes, the scene manager has to load too many things, which, in return, causes two destructive behaviors:

it saturates the bandwidth between the RAM and the CPU, and
it destroys the cache (because the amount of data is too big).

We should have a very lightweight node object containing only the minimal information to go through this loop. Then, the additional data can be packed in a structure. Such data can be queried only on a per need basis. Therefore, whenever we go through for rendering, we can have all the rendering things in one component. Whenever we update the animation, we can have the animated data related in another component, etc. This could allow to efficiently go through the list without loading useless things, similar to an index in a database.

1 - To batch common operations together in sequence.

Before:

for node in nodes:
 node.computevisibility()
 node.computecollision()

After:

for lightweightnode in lightweightnodes:
 lightweightnode.get_component_render().computevisibility()

for lightweightnode in lightweightnodes:
 lightweightnode.get_component_physic().computecollision()

While it might seem counter intuitive, this could avoid to load big objects into the memory with unused data, thus, saturating the bandwidth.

2 - To use a vector array instead of a chained list.

Currently, we store the scene nodes in a chained list. They are generally slower to get through. For better efficiency, we would need to switch it to an array. There is no reason why we should keep the chained list.

3 - To implement a very basic space division system.

Using the new very lightweight structure, we can then divide the terrain into 8x8 chunks (or more depending on your scenario). Each chunk should contain a list of children lightnode only, and whether the chunk itself is visible or not.

Next, we can put all the non-moving objects into those chunks and compute the visibility for each chunk. Then you just have to do the same for lightweight nodes who are in the respective chunks.

This way, the heavy computation is only applied on objects which are visible, and you don't have to go through the 10k objects for each frame.

4 - To avoid doing dynamic casting.

In the render loop, there are several dynamic cast, eg: (STKParticle node = dynamic_cast<STKParticle>(*I)). Those should be avoided at all cost, especially in something as intensive as the render loop. If they are unavoidable, try putting them in the heavy computation part and keeping the lightweight node as basic as possible.

Conclusion

While it's true that the SceneManager architecture is a bit outdated, these pieces of advice are probably the lowest hanging fruits regarding the scene manager. The potential in performance improvement is quite huge especially with item 0 and 3.

However, it's important to understand that those are things that, taken alone, might not make any difference, but still need to be done and kept, since they are steps forward towards the right direction.

I would also like to say, in the name of the whole team, a big thank you to my coworker, who took the time to review our scene manager.

Sam

Smartninja365 commented 2 years ago

Thank you for considering people on older machines such as myself :)

Benau commented 1 year ago

Should be a lot faster with vulkan engine

qwertychouskie commented 1 year ago

LOD test track went from ~68 FPS to ~116 FPS with Vulkan on my system, much improved for sure

qwertychouskie commented 1 year ago

Also did a quick comparison with STK 1.3:

1.3 Flatpak (OpenGL): ~53 FPS
Git (OpenGL): ~85 FPS
Git (Vulkan): ~116 FPS

Alayan-stk-2 commented 2 months ago

https://github.com/supertuxkart/stk-code/issues/3780#issuecomment-2085246279

Testing the Lodtest track while making everything invisible by returning automatically "-1" (don't display) when checking the LoD level.

Compiled in STKRelease mode, using OpenGL.

1.0 : ~100FPS Master with the invisible objects : ~230FPS Master with the objects removed through the "very low" setting : ~750FPS

While there is a very clear improvement, the perf cost of invisible objects is still way too high.

supertuxkart / stk-code