godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
91.47k stars 21.26k forks source link

Rocket Bot Royale - Post-mortem discussion and engine diffs. #62291

Open jordo opened 2 years ago

jordo commented 2 years ago

Godot version

3.4

System information

HTML5, Windows, Linux, MacOS, iOS, Android, Linux Server

Issue description

Post mortem for Rocket Bot Royale. A cross-platform, multiplayer, server-authoritative, battle-royale game. https://rocketbotroyale.winterpixel.io

Steps to reproduce

N/A - see first comment.

Minimal reproduction project

N/A - see first comment.

jordo commented 2 years ago

I've put up our core engine diff (excluding some custom networking modules) here: https://github.com/winterpixelgames/godot/commit/55344ea7a7b6784a574b9f375e236f07c40ba767 The last upstream merge we've brought into our fork was in 3.4.1.rc1, which was https://github.com/godotengine/godot/commit/489b49c6b89f68b9f3cc3f8a9579c6ed1e2fdf82 (for context).

The format of this issue isn't for anything specific per se. It's intention is more to provide the community & godot project team a port-mortem analysis for this project. As well, hopefully to describe & discuss the engine limitations we've run into during development and how and where we've had to modify the engine to address any issues that we've run into. Perhaps someone as well may find any of the changes in the diffs below useful.

General Impressions & Project Scope:

In short, we've generally enjoyed using the engine for this project. I would consider the 'lean-ness' of the engine a competitive advantage in the market right now with alternative engines being quite bulky and bloaty. Even though godot is relatively lean, we didn't really find a core 'feature' that was missing from the engine that would have been considered a show-stopper for our project. Generally speaking, anything that was missing was just implemented or worked around. The larger items were probably the networking & physics systems. We found out relatively quickly that the architecture in godot regarding those two components would not work for the client-predicted/rollback network model we wished to achieve. We worked around these limitations by using a Box2D native module for physics, and building our own networked native module. But for the most part, hurdles working with the engine were mainly related to a few areas: Engine ergonomics & architecture, performance, robustness, and interoperability with existing tooling. Particularly wrt debugging, instrumentation, and profiling.

Diff Notes:

Collection comparator of ==operator by value Godot exposed (not sure if this is the case anymore) ==operator. We pulled in a collection compare by value PR, which is standard in all other c++ template libraries. Godot's Dictionary and Array evaluated to not equal which was very odd. Changes related to Array and Dictionary equality operator have since been merged in by https://github.com/godotengine/godot/pull/35816

Memory allocation instrumentation & monitor We found memory operation overhead to be one of the main issues with the engine's performance. We added an allocation monitor to track number of allocations per frame, and time spent allocating and freeing memory per frame (time spent computation only compiled in debug builds as the clock sampling itself it is a performance hit).

Screen Shot 2022-06-21 at 4 24 14 PM

Added different keyboard types on iOS | Keyboard types Additional support for different keyboard types on iOS.

Bitarray packing encoded types and compression levels These piggybacked on the work by @AndreaCatania's scene replicator PR, in which we efficiently encode different types with 5 different bit-level compression settings. From 1-bit encoded types up to 64-bit. We added these as gdscript/property annotations to easily express how many bits to use for value encoding or properties in our network module.

DNS resolver runs out of queries There's a bug in the DNS resolver and it currently runs out of queries and can get hung. We're just working around this right now by bumping this number and punting the issue further down the session.

The godot Socket API for multiple messages does superfluous memcpy's Both POSIX and win32 have an API for sending multiple buffers at one. sendmsg() version 3, which takes a pointer to buffers and a len. Godot currently doesn't use this, and instead for each outgoing message does an additional memcopy for each message into a unified buffer before calling sendmsg. We did submit a PR for this optimization but it wasn't accepted. I believe due to DTLS peer issues. In a battle royale, our gameservers send a large amount of messages and this was a nice optimization.

Potential infinite loop/deadlock fix We experienced a deadlock here, this one was quite tough to find and debug.

Avoid deadlocks in HTML5 threaded exports HTML5 threaded exports are susceptible to emscripten deadlocks when not using PROXY_TO_PTHREAD (godot). Locking on the main browser thread, and delivering inputs or other events can cause a emscripten deadlock (red warning when calling libc block funtion on main thread). Without PROXY_TO_PTHREAD, to implement locking without emscripten deadlocks, we chose to spin instead of calling wait.

Thread changes to eliminate data race condition These changes in thread intialization eliminates a data race condition that was found using TSAN.

Exposing thread current_id to workaround overused singletons in the codebase This can be explained later on with a dummy visual server implementation, and ability to bypass a singleton imple by thread id.

Disables screen texture mipmap generation This was needed to shave 1.5ms off of rendering on iOS devices, when the effects were were using only needed to sample the screen texture, not the generate the whole screen mipmap chain. This disabled that generation and saved 10-15% render time on iOS mobile test devices.

Thread safe script language server Made this thread safe.

Stringname has a double-free vuln Unref check is not atomic, leads to a double free.

Debugging gdscript errors in release is HARD, and there is not enough information We added a variety of checks in the gdscript runtime to print out more helpful runtime errors in gdscript in release builds. The macro WINTERPIXEL_ERR_CHECKS which adds in some minor checks, with the advantage that gdscript errors are much more meaningful and helpful in production release builds. Most of the additional checks are in https://github.com/winterpixelgames/godot/commit/55344ea7a7b6784a574b9f375e236f07c40ba767?diff=unified#diff-52701eeaae348d23ecd9fa39375996c62dabe96e805b8fc64f1172a8b7015a07R37-R1550

Textures left bound break WebGL in weird ways We fixed a number of errors with resources still being bound and WebGL complaining about things and throwing errors. Particularly, there were issues with particles. This change unbinds these resources and eliminates some of those issues on that platform.

Enet changes for proper unreliable transport This eventually was stripped out from another PR and merged in https://github.com/godotengine/godot/pull/53129

Modified gdscript parser to annotate network synced properties in gdscript More of a custom use for our particular use case of gdscript.

Decode once audio streams This ended up being a REALLY important optimization. To keep binary size down, we're using ogg/vorbis compression on all sfx and music. During performance testing (especially on web), we saw a ton of time spent decoding audio files. Not only that, but scenes that were instantiated with audio resources, (and not ever used), paid a huge performance penalty because a bunch of memory buffer's are allocated on instantiation if an audio stream player is instantiated with any decoded audio resource. Say something like this:

Screen Shot 2022-06-21 at 5 19 33 PM

When profiling the instantiation of the scene above, almost all of the cpu instantiation time is spent in the ogg decoder of each of those streams, reading and setting up some initial decode buffers for those streams (even if they are never played). This ended up being a really big hiccup. Not only that, but we were paying the cpu decode cost of decoding these formats everytime they were played, which when we profiled was really significant.

To address this, we implemented a 'decode once' audio resource import parameter, in which an encoded audio resource is only ever decoded once to PCM. Subsequent uses of the audio resource then stream the simple decoded PCM format, which uses much less cpu resource. The trade off here is runtime cpu for memory footprint:

Screen Shot 2022-06-21 at 5 17 57 PM

This ended up being an almost 3X saving in audio processing cpu time on web, and also eliminated many scene instantiation hiccups where the scene included audio stream nodes which were using encoded audio formats.

Fixes blocking on the main thread when using websockets Websockets end up being more like httprequest (non blocking on main godot thread).

Flags for debug symbols included in Android builds pre gradle strip Android/cmake/gradle does a strip in a cmake build artifact later on in the android build process. Compiling with these symbols included allows us to inspect the pre-strip artifact folder which can be uploaded to google play for symbolicated native stack traces. Likewise on iOS

Various changes to emscripten settings Various emscripten tweaks here based off of performance testing in a variety of browsers. Actually, many javascript changes and improvements in the javascript changes there and subsequent files.

Cross Codesign macOS Most of these changes were just pulling in @bruvzg 's improvements to code-signing macOS and iOS on different platforms.

'Relative' particles 2D mode, in addition to existing 'local' and 'world' This is a common particles pattern where the particle positions are relative to their parent. (Not just local or world options).

ScrollContainer sensitivity and damping A lot of issues between mouse, scrollwheel, and trackpads here.

Lastly, we added a dummy visual server. visual_server_dummy.h and visual_server_dummy.cpp and a way for a thread to register itself to use the dummy server instead of the actual visual_server singleton. I think the singleton pattern is overused in the codebase, and this is a good example of what we had to do hack around the pattern (very gross). We modified the singleton accessor in in visual_server.cpp, to be able to register specific threads that when accessing the singleton would receive the dummy server instead of the actual server (thread id check. If registered dummy thread, return dummy visual server, if not, return actual visual server).

The reason for the above is the visual server is not thread safe, and we run our netcode rollback (reset to a past state, and replay the game in the background to catch up to realtime) in worker threads. That means resetting scene trees, state, properties, nodes, etc, and running all the logic that has run in the past in a seperate simulation which is then fast-forward simulated back to the current predicted state. For game logic to rerun, that means instantiating scenes, firing the same signals, literally running the same codepaths. For that we use a shadow SceneTree, and post resimulation jobs to separate worker threads. All this to say, is that if a SceneTree object depends on a reference to a visual server, it should be injected as a dependency/reference into a SceneTree instead of being a singleton. It would make doing this like the above scenario very clean.

reduz commented 2 years ago

Thanks hugely for the write up and taking the time and work of putting up the diff. I am personally happy that Godot was useful for developing and publishing Rocket Bot Royale and I'm sure that a lot of the changes you had to do will be invaluable in improving the code quality!

I think some of these issues have been significantly improved in master branch, but I am super curious to see what your solutions (and problems found) were in order to keep improving the codebase.

Calinou commented 2 years ago

Thanks for the detailed writeup, it's very informative :slightly_smiling_face:

I remember the memory instrumentation profiler is something @lawnjelly was interested in having. Maybe he could help get it merged?

This ended up being a REALLY important optimization. To keep binary size down, we're using ogg/vorbis compression on all sfx and music. During performance testing (especially on web), we saw a ton of time spent decoding audio files. Not only that, but scenes that were instantiated with audio resources, (and not ever used), paid a huge performance penalty because a bunch of memory buffer's are allocated on instantiation if an audio stream player is instantiated with any decoded audio resource.

For reference, similar functionality was requested in https://github.com/godotengine/godot-proposals/issues/1144.

That said, I wonder if you could sidestep the problem by instancing AudioStreamPlayer2D nodes only when needed, and free them once they're done playing. This can be done with a single node for multiple sounds in 4.0 thanks to polyphony support.

jitspoe commented 1 year ago

I randomly stumbled upon this when searching for another issue, and I just wanted to say the decode once for ogg vorbis is a brilliant idea. Would love to see that in the core engine. I wondered if anybody created a proposal for that and then I see Calinou linked to the one I created, which I completely forgot about. Boy, my memory is terrible. 😅