Using the slowest data structure almost every time.

godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine

https://godotengine.org

MIT License

89.88k stars 21.03k forks source link

Using the slowest data structure almost every time. #23998

Closed vblanco20-1 closed 5 years ago

vblanco20-1 commented 5 years ago

MODERATOR NOTE: This issue is over five years old and discusses a reality that no longer matches the status quo of the codebase. For more information, scroll to this comment.

I was reviewing the code of godot, and i`ve found complete disregard for any kind of cpu side performance, down to the core data structures. Its using List everywhere, when a linked list is the slowest data structure you can use on modern PCs. Almost any other data structure would be faster, specially on small sized structures. Its specially egregious on things like the light list or reflection captures in renderables, where you are storing 3 pointers every time, instead of just giving it a stack array of 8 max lights (for example) + an extra for the cases where it would be more than that.

Same thing with RID_Owner being a tree where it could be a hash map, or a slotmap. Also the Octree implementation for culling has the exact same issue.

I want to ask about the design intention behind that total and absolute overuse of linked lists and dreadful performance pointer heavy data structures everywhere in the code. Is that for a specific reason? In most of the cases a "chunked" linked list, where you do a linked list of arrays would automatically bring performance gains on the same code.

The use of these data structures also prevents any kind of "easy" parallelism in a good chunk of the code and completely trashes the cache.

I am working on making a proof of concept implementation of refactoring some of the internals to use better data structures. At this moment i am working on a rewrite of the culling code which bypasses the current octree and just uses a flat array with optional parallelism. Ill use the tps-demo as benchmark and come back with results, which in fact i have benchmarked to go 25 levels deep on that octree...

On another more happy note, i am very impressed with the code style quality, everything is easy to follow and understand, and quite well commented.

reduz commented 5 years ago

I'm closing this thread, because I don't think it's the right way to contribute or help.

@vblanco20-1 Again, I really appreciate that you have good intentions, but you don't understand any of the inner workings of the engine, or the philosophy behind it. Most of your comments are towards things you don't really understand how they are used, how critical they are to performance, or what their priorities are in general.

Your tone is also unnecessarily aggressive, too. Instead of asking what you don't understand, or why something is done in a certain way, you just go all out arrogant. This is not the right attitude for this community.

Most of the things you are mentioning regarding optimization are only a very small surface of the amount of items that I planned for optimization in Godot 4.0 (there are plenty more things in my list). I already told you this would get rewritten several times since several months ago. If you don't believe me, it's fine, but I feel you are wasting your own time with this and confusing a lot of people for no obvious reason.

I obviously welcome very much your feedback once I'm done with my work, but all you are doing now is beating a dead horse, so just chill for a while.

reduz commented 5 years ago

Again, when an alpha of Godot 4.0 is rolled out (hopefully before the end of this year), you are welcome to profile all the new code and give feedback on how further optimize it. I'm sure that we will all benefit. For now, there is not much of a point in discussing anything else here since existing code will go away in 4.0, and nothing will get merged in the 3.x branch, where at this point stability is more important than optimizations.

Since many may be curious about the technical details:

All the spatial indexing code (what vblanco rewrote) will be replaced by a linear algorithm for culling with multiple threads, combined with an octree for hierarchial occlusion culling and a SAP for overlapping checks, which are most likely the best all rounder algorithms that warrant good performance on any kind of game. Allocation for these structures will be in the same vein as the new RID_Allocator, which is O(1) ). I discussed this with @vblanco20-1 before and explained that his approach does not scale well for all types of games, as it requires user to have a certain degree of expertise to tweak which commonly the typical Godot user is not expected to have. It was also not a good approach for adding occlusion culling.
I won't use arrays when lists can be used because lists do small temporal allocations with zero risk of fragentation. In some cases, I prefer to allocate a sectioned array (aligned to pages, so they cause 0 fragmentation) that always grow and never shrink (like in RID_Allocator or the new CanvasItem in the 2D engine Vulkan branch, which now allows you to redraw items with a lot of commands very efficiently), but there has to be a performance reason for this. When lists are used in Godot it's because small allocations are preferred over performance (and actually they make the code intent more clear for others to read).
PoolVector is intended for very large allocations with consecutive memory pages. Until Godot 2.1 it used a preallocated memory pool, but this was removed in 3.x and right now the current behavior is wrong. For 4.0 it will be replaced with virtual memory, it's on the list of things to do.
Comparing Godot's Vector<> to std::vector is pointless because they have different use cases. We use it mostly to pass data around and we lock it for fast access (via ptr() or ptrw() methods). If we used std::vector, Godot would be much slower due to unnecessary copying. We also take a lot of advantage of the copy on write mechanic for many different uses.
Map<> is just simpler and friendlier for large amount of elements and no need to worry about hastable growing/shinking creating fragmentation. When performance is required, HashMap is used instead (though it's true, should probably use OAHashMap more, but it's too new and never had time to do that). As a general philosophy, when performance is not priority, small allocations are always preferred over large ones because it's easier for the memory allocator from your operating system to find tiny holes to put them in (which is pretty much what it is designed to do), effectively consuming less heap.

Again, you can ask about plans and design anytime you want instead of going rogue and complain, this is not the best way to help the project.

I'm also sure that many reading this thread are wondering why the spatial indexer was slow to begin with. The reason is that, maybe you are new to Godot, but until very recently the 3D engine was more than 10 years old and extremely outdated. Work was done to modernize it in OpenGL ES 3.0, but we had to desist due to problems we found in OpenGL and the fact it was deprecated for Vulkan (and Apple abandoned it).

Added to this, Godot used to run on devices like the PSP not too long ago (which had only 24mb of memory available for both engine and game, so a lot of the core code is very conservative regarding memory allocation). As hardware is very different now, this is being changed for code that is more optimal and uses more memory, but when it's not needed doing this work is pointless and it's fine that you'll see lists used in a lot of places where performance does not matter.

Also, many optimizations we wanted to do (moving many of the mutex code to atomics for better performance) had to be put on hold until we could move Godot to C++11 (which has much better support for inlined atomics and it does not requiere you to include windows headers, which pollute the whole namespace), which was not something we could do in a stable branch. The move to C++11 will take place after Godot 3.2 is branched and feature freezed, otherwise keeping Vulkan an Master branches in sync would be a huge pain. There is not much of a hurry, as currently focus is now on Vulkan itself.

Sorry, things take time, but I prefer they are done properly rather than rushing them. It pays off better in the long term. Right now all the performance optimizations are being worked on and should be ready soon (if you tested the Vulkan branch, the 2D engine is way faster than it used to be already).

Windfisch commented 5 years ago

Hi reduz, while I mostly see that your points are valid, I'd like to comment on two on which I disagree:

I won't use arrays when lists can be used because lists do small temporal allocations with zero risk of fragmentation. In some cases, I prefer to allocate a sectioned array (aligned to pages, so they cause 0 fragmentation) that always grow and never shrink (like in RID_Allocator or the new CanvasItem in the 2D engine Vulkan branch, which now allows you to redraw items with a lot of commands very efficiently), but there has to be a performance reason for this. When lists are used in Godot it's because small allocations are preferred over performance (and actually they make the code intent more clear for others to read).

I highly doubt that a linked list will have better overall performance, let it be speed or memory efficiency, than a dynamic array with exponential growth. The latter is guaranteed to occupy at most twice as much space as it actually uses, while a List<some pointer> uses exactly three times as much storage (the actual content, the next and the prev pointer). For a sectioned array, things look even better.

When properly wrapped (and as much I can tell from what I've seen already of the godot code, they are), they look pretty much the same to the programmer, so I don't understand what you mean with "they [Lists] make the code intent more clear".

IMHO, Lists are valid exactly in two conditions:

You need to frequently erase/insert elements at the middle of the container
or you need constant-time (and not just amortized constant-time) insertion/deletion of elements. I.e., in real-time contexts where a time-consuming allocation of memory isn't possible, they're fine.

Map<> is just simpler and friendlier for large amount of elements and no need to worry about hastable growing/shinking creating fragmentation. When performance is required, HashMap is used instead (though it's true, should probably use OAHashMap more, but it's too new and never had time to do that). As a general philosophy, when performance is not priority, small allocations are always preferred over large ones because it's easier for the memory allocator from your operating system to find tiny holes to put them in (which is pretty much what it is designed to do), effectively consuming less heap.

libc-allocators are usually quite smart, and since an OAHashMap (or a std::unordered_map) reallocate their storages from time to time (amortized constant-time), the allocator usually manages to keep its memory blocks compact. I firmly believe that an OAHashMap does not effectively consume more heap than a plain binary tree map like Map. Instead, I'm quite sure that the huge pointer overhead in each element of Map actually consumes more memory than any heap fragmentation of OAHashmap (or std::unordered_map).

After all, I think the best way to sort out these kinds of arguments is to benchmark it. Surely, this is of much greater use for Godot 4.0, since -- as you said -- lots of performance optimizations will happen there and there is not much use of improving code paths that may well be completely rewritten in 4.0 anyway.

But @reduz, what do you think of benchmarking all these changes that @vblanco20-1 suggested (maybe even now, in 3.1). If @vblanco20-1 (or anyone else) is willing to invest the time of writing such a benchmarking suite and to evaluate Godot3.1's performance (both in terms of speed and "heap consumption considering fragmentation") against vblanco's changes? It might yield valuable hints for the actual 4.0 changes.

I think that such a methodology fits your "[things] are done properly rather than rushing them" well.

@vblanco20-1: In fact, I appreciate your work. Would you be motivated to create such benchmarks, so we can actually measure whether your changes are actual performance improvements? I would be very interested.

reduz commented 5 years ago

@Windfisch I suggest you re-read my post above, as you have misread or misunderstood a few things. I'll clarify them for you.

Lists are used exactly for the use case you describe, and I never claimed that they have better performance. They are more efficient than arrays for reusing heap simply because they are made-up of small allocations. On a scale (when you use them a lot for their intended use case), this really makes a difference. When performance is required, other containers which are faster, more packed or have better cache coherency are used already. For good or bad, Victor mostly focused in one of the older engine areas (if not the oldest actually) which was never optimized since the times of it being an in-house engine used to publish games for PSP. This had a rewrite pending since a long time, but there were other priorities. His main optimization was the CPU based voxel cone tracing I added recently, which to be honest, I did a bad job with that because it was too rushed near the 3.0 release, but the proper fix for this is a different algorithm entirely, and not adding parallel processing like he did.
I never argued about the performance of @vblanco20-1's work and I frankly don't care about it (so you don't need to make him waste time doing benchmarks). The reasons for not merging his work is because 1) The algorithms he uses needs manual tweaking depending on average size of objects in the game, something most Godot users will need to do. I tend to favor algorithms that may be a bit slower but scale better, without the need for tweaks. 2) The algorithm he uses is not good for occlusion culling (simple octree is better due to the hierarchical nature). 3) The algorithm he uses is not good for pairing (SAP is often better). 4) He uses C++17 and libraries I'm not interested in supporting, or lambdas which I believe are unnecessary 5) I'm already working on optimizing this for 4.0, and the 3.x branch has stability as priority and we intend to release 3.2 as soon as possible, so this won't be modified or worked on there. Existing code may be slower, but it's very stable and tested. If this gets merged ant there are bug reports, regressions, etc. No one will have time to work on it or help Victor because we are already mostly busy with 4.0 branch. This is all explained above, so I suggest you re-read the post.

In any case, I promised Victor that indexing code can be pluggable, so eventually different algorithms for different types of games can be implemented, too.

Godot is open source and so is his fork. We are all open and sharing here, nothing should stop you from using his work if you need it.

Ranoller commented 5 years ago

Since that optimizations seems not to affect gdscript, render or the "stuttering" problems, and there are the things that people complains (me includes), maybe with acomplish that optimizations people will be happy (me includes)... doesn´t need lua jit velocity... Works in "copy on write" was a very big performance optimization in a plugin of mine (from 25 seconds in a script parse to only 1 second in a script of 7000 lines)... i feel that this kind of optimizations are that we need, in gdscript, in render and accomplish the stuttering problem... that´s all.

Windfisch commented 5 years ago

Thanks @reduz for your clarification. It indeed made your points more clear than the previous postings.

It's good that the spatial indexing code will be pluggable, because indeed that one fell on my feet before when handling lots of objects at vastly different scales. Looking forward to 4.0.

reduz commented 5 years ago

I was also thinking about it and I think it may be a good idea to put some shared document with ideas on optimizing spatial indexing, so more contributors can learn about this and also throw ideas or do implementations. I have very good idea about what needs to be done, but I'm sure there is a lot of room here for doing further optimizations and coming up with interesting algorithms.

We can put very clear requirements that we know algorithms have to meet (as in, not requiring user tweaking if possible for average elements in the world size, not brute-forcing stuff with threads if possible -they are not free, other parts of the engine may need them too, like physics or animation and they consume more battery on mobile-, compatibility with re-projection based occlusion culling -so, some form of hierarchy may be desired, but should be tested against brute force too-, be smart about shadow buffer updates -don't update if nothing changed-, explore optimizations such as re-projection based occlusion culling for directional shadows, etc). We can also discuss creation of some benchmark tests (I don't think the TPS demo is a good benchmark because it doesn't have as many objects or occlusion). @vblanco20-1 if you are willing to follow our coding/language style and philosophy you are of course more than welcome to lend a hand.

The rest of the rendering code (actual rendering using RenderingDevice) is more or less straightforward and there's not many ways to do it, but indexing seems like a more interesting problem to solve for optimization.

vblanco20-1 commented 5 years ago

@reduz for reference on the spatial indexing. The original tile based culling was removed and replaced by an Octree about halfway through this thread. The octree i got there is WIP (missing some re-fit functionality), but the results are quite good for the prototype. Its code is not that good from its prototype nature, so its only useful to check how this kind of octree would perform in a complex scene such as tps-demo. It is inspired by how the unreal engine octree works, but with some modifications like the possibility of a flat iterate.

The main idea is that only the leafs on the octree hold objects, and those objects are held on an array of size 64 (compile time size, can be different). An octree leaf will only split once it "overflows" into 65 elements. When you remove objects, if each of the leafs of the parent node fit into the 64 size array, then we merge the leafs back into their parent node, which becomes a leaf.

By doing that, we can minimize the time testing on the nodes, as the octree wont end up being too deep.

Another good thing i was doing is that the leaf nodes are also stored in a flat array, which allows a parallel-for in the culling. This way the hierarchical cull can be used when doing cull for point shadows, or other "small" operations, and the flat parallel-for culling can be used for the main view. Of course could just use hierarchical for everything, but could be slower and cant be parallelized.

The block structure will waste a bit of memory, but even in worst case scenarios i dont think it will waste a lot of memory as the nodes will merge if they drop below an amount. It also allows using a pool allocator given that both the nodes and the leafs will have a constant size.

I also have multiple octrees in my fork, which is used to optimize some things. For example i have an octree only for shadowcaster objects, which allows skipping all the logic related to "can cast shadows" when culling for shadowmaps.

On my concerns for Vector and others on the render API, this issue explains what i was worrying about. https://github.com/godotengine/godot/issues/24731

On the library and C++17 stuff of the fork.. thats unnecesary. The fork overuses a lot of libraries because i needed some parts of them. The only thing thats really needed, and that i think godot needs, is a industry-strenght parallel queue, i used moodycamel queue for it.

On the lambdas, their usage is mainly for culling, and its used to save a significant amount of memory, as you only need to save the objects that pass your checks into the output array. The alternative is to do iterator objects (thats what unreal engine does with their octree), but it ends up being worse code and a lot harder to write.

My intention was not to "go rogue", but to answer frequent comments on "you are free to make a fork to demonstrate", which is exactly what ive done. The first message on the thread is a bit alarmist and not very correct. Sorry if i appeared rude to you, as my only goal is to help the most popular open source game engine.

reduz commented 5 years ago

@vblanco20-1 That sounds great, and the way you mention the octree works makes a lot of sense. I honestly don't mind implementing a parallel queue (seems simple enough for not needing an external dependency, and need C++11 to do it properly, so guess that will happen only on the Vulkan branch). If you have this commited, I will definitely want to take a look at it, so I can use it as a basis for the indexer rewrite in the Vulkan branch (I'm just rewriting most stuff there, so it needs to be rewritten anyway). Of course if you want to aid with the implementation (probably when there are more things in place with the new API), it's super welcome.

The use of a parallel version with flattened arrays is interesting, but my point of view on it is that it won't be that useful for occlusion culling, and it would be interesting to measure how much of an improvement over regular octree culling is, considering the extra amount of cores used. Keep in mind there are many other areas that may be using multiple threads that may make more efficient (less brute force) use of it (like physics), so if even if with this approach we get a relatively marginal improvement, it is not always in the best of interests, as it may starve other areas from using CPU. Maybe it could be optional.

I also find Bullet 3's implementation of dbvt pretty interesting, which does incremental self balancing and linear allocation, it's one of the things I wanted to research more (I've seen this approach implemented in proprietary engines I can't mention :P ), as algorithm wise, a balanced binary tree is by design just a lot less redundant than an octree, and may work better in both pairing and occlusion tests, so a pluggable broadphase/culling aproach may be actually a good idea, given we make proper useful benchmark scenes.

In any case, you are exceptionally smart and it would be great if we can work together on this, but I will just ask you to please understand and respect many of the existing design philosophies we have, even if they may not always suit your taste. The project is huge, with a lot of inertial force, and changing things for the sake of tastes is just not a good idea. User friendliness and "just works" is always first on the list regarding design, and simplicity and readability of code usually takes priority over efficiency (that is, when efficiency is not a goal because an area is not critical). Last time we discussed, you wanted to change pretty much the whole thing and did not want to listen to any reasoning or focus on actual problems, and that's not how we usually work with the community and contributors, which is why my advice to chill was well intended.

BlueCannonBall commented 4 years ago

Who needs performance? :trollface:

Performance and source model is what keeps me with Godot.

Anyeos commented 4 years ago

Edit: Sorry, maybe I am some offtopic but I wanted to clarify the benefits of open source.

@mixedCase I would be extremely hesitant to support any kind of fork of Godot. Forking and fragmentation is often the death of open source projects.

I don't think so. The nature of open source is just that you can use and reuse the code freely as you want. So that will not be an end of some project but more options for the users. What you say is a monopoly, and that is not what defend the open source. Don't get cheated by companies that pretend to show you that there is better to have full control of something. That is a lie, at least in the open source world, because if someone other have the code you too have the code. What they need to do is to take care of how to handle a community or how to be OK with that.

Anyway the original developers can just merge improvements from the fork. They are free to do it always. That is another nature of the open source world.

And in the worst of the cases the fork will be better than the original and a lot of contributer will go there. Nobody lost, all win. Oh, sorry, if a company is behind the original maybe they lost (or they can merge the improvements from the fork too).

Koalamana9 commented 1 year ago

Writing to you from the future in case someone stumbles upon this commit and reads all the nonsense that Juan wrote here. Godot 4 in 2023 is worse in performance than Unity 2.5 (from 2009).

Saul2022 commented 1 year ago

Godot 4 in 2023 is worse in performance than Unity 2.5 (from 2009).

Source?

swarnimarun commented 1 year ago

Could we get this thread locked and moved to a dedicated and well formed conversation about specific performance improvements to RFC(Godot Proposals). I don't think this thread has anything meaningful add to the conversation around performance issues in Godot any longer(though I do believe some performance improvements would be nice but now this is just noise), I think something more directly actionable should be preferred.

cc: @Calinou @akien-mga ?

Venryx commented 1 year ago

If locking of the thread is imminent, it might be nice to have some links at the end referencing the more actionable versions of the proposals raised in this thread. So if anyone is aware of such proposals/links, please share them now. 🔗

YuriSizov commented 1 year ago

@swarnimarun We don't lock threads unless there is a spam issue or other form of bad actor activity. One random reply after 4 years is not a valid reason for it.

We prefer issues are open for discussion at all times, because some may find even the old threads helpful.

reduz commented 1 year ago

Just to note on my side. This is a very old issue, almost five years old. But somehow it keeps resurfacing in social media, so I think its a good time to add some clarifications.

Everything mentioned originally has been 100% corrected (and I want to thank @vblanco20-1 who provided many of the original suggestions):

Linked Lists no longer used in hot paths
RID Owner uses O(1) allocation (array)
We have a superb implementation of HashMap and HashSet in Godot 4, we even took care to ensure the hashing functions for each Godot datatype generate the best entropy, and hence have the best distribution possible.
We use worst case memory allocators (despite the name this is a good thing, means memory grows to accommodate the game needs on demand and stays there) and pools almost everywhere performance matters.
Data oriented structures are used now extensively where it matters.
Original issues raised regarding culling have both been resolved in both Godot 3 and Godot 4.
Threading now is used extensively.
The list of improvements since this issue was opened is far too long. Five years is a long time.

Even though what is mentioned on this issue has already been taken care of, there still is work pending to do optimization wise in Godot 4:

Multithreading in many areas still needs work (mainly on the scene side, this is very new).
Some locking in a few areas could be better.
Godot still does not support texture/mesh streaming, which should help a lot with I/O performance.
And really many other things.

But the situation now is day and night compared to 2018. There has been literally hundreds of pull requests geared to optimize Godot in the meantime, to the point that Godot 4 is pretty much all new code rewritten from scratch.

There is also a lot of continued work being done to optimize different areas and this has not stopped.

Zireael07 commented 1 year ago

IMHO I'd close this issue since it's so horribly outdated apparently, and open a new one for the pending things (if needed at all)

YuriSizov commented 1 year ago

@Zireael07 This issue is closed. We can just... stop replying in it 🙃

Zireael07 commented 1 year ago

Oops. Didn't even notice it was closed, then the best solution would be to lock ;)

reptofrog commented 1 year ago

I think it's a good idea to lock the issue to make sure Juan's comment here is seen by people and is not lost among other comments that may come here after. Like mine right now.

akien-mga commented 1 year ago

I'm locking temporarily to stop the noise from people who keep asking to lock this. Please stop commenting about this.

I'll reopen it then so people who actually have something to say on the topics of this issue can do it.

Edit: Unlocked.

vblanco20-1 commented 1 year ago

Ill comment as i see people still pinging me about this issue. Modern godot has fixed most of the issues detailed here, and the architecture has changed enough that the 2 versions are not comparable. Even the tps demo used as the benchmark has changed significantly. This fork performance upgrades as detailed in posts above were done over the base of godot 3.3, and things have really changed since then. Also the fork is a experimental thing i created for fun so it goes into absurd levels just because i felt like it, like running shadow casting render logic in parallel threads or heavily parallelizing most of the renderer beyond what makes practical sense. It only really works for the TPS demo and its near guaranteed to be bugged for anything else.