Implement group-based threaded scene processing

reduz commented 1 year ago

Describe the project you are working on

Godot

Describe the problem or limitation you are having in your project

CPUs are getting more, and more and more hardware threads. So it's obvious that, to scale, Godot needs to be able to make use of this.

On the server level, Godot is strongly multi threaded for things like rendering and physics. Accessing the Godot servers is also safe from multiple threads.

But, on the scene level, there is no threading at all. Godot is single threaded. Would it not make sense that, if there are 50 enemies in your scene and your CPU has 32 threads, every enemy is processed each in a thread? (so basically 32 threads are processing the 50 enemies in a queue fashion) This would give a massive performance improvement.

But well, this is not possible.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

I previously opened another proposal (#4962) with a different solution to this, which is more akin to a traditional job scheduler.

The problem is, I don't like this. Its not easy to use, it's not easy to mark dependencies. You are still free to make mistakes and putting mutexes all across the scene system and hoping this avoids users from making mistakes is wishful thinking (It would probably deadlock in a heartbeat).

So, thinking about how the use case works, and following Godot design and usage, one could argue that most games could be divided into "functional units". As an example, enemies are functional units, you can have dozens of them. Same for npcs, items, vfx, etc.

On the Godot sense, I feel that the simplest way to make it possible for users to scale their games easily with threads is to "mark" which scenes are their own thing and work on their own, and have this happen automatically.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

This would be achieved by a single extra property in the Node class: thread_group, which would have the following values:

Inherit: Process in the same thread group as the parent node.
Group: Create a new thread group
Disabled: Does not process on threads.

If Group Is used, the processing of this node and all the children nodes will happen in a separate thread, in parallel to other nodes. And that's it, very simple.

Now comes the big question: What happens if this node or any of the children is processing on a thread and it attempts to call a function or property on another node that does not belong to this group?

The answer is you can't. The function will error and return immediately. If you want to interact with other nodes, you will be forced to use some sort of IPC. Such as:

foreignnode.some_function.call_deferred(arguments) # goes to main thread
foreignnode.some_function.call_thread_group(arguments) # gets called before next time group is processed

What if you want to read a value from a node? then we could have a way to intern processing results that are read-only and can be read from any other thread.

How do you debug this mess?

GDScript thread debugging needs to work before this can be used efficiently, but it should be very easy to have a toggle to run the game in single threaded mode too to make it easier to debug complex errors.

To make sure users are making the best use of threads available, the idea is to have a profiler that shows how threads were utilized every frame. This would allow users to see how well this is working.

How is this implemented efficiently?

This relies on thread local storage. When a group is processed, the group _threadid is set, which is stored where it belongs.

So, this check can be added to Node:

_FORCE_INLINE_ bool Node::is_thread_group_invalid() const {
        return thread_group_processing && data.inside_tree && current_thread_group_id != data.thred_group_id;
}

And, in most scene functions (specially the core ones like Node, Node2D, Nod3D, Control, etc), there would be checks like this:

void Node2D::set_position(const Vector2& p_posiition) {
   ERR_FAIL_COND( is_group_invalid() );
   // rest of code
}

or maybe simply adding a:

ERR_FAIL_THREAD_GROUP and ERR_FAIL_COND_THREAD_GROUP_V, which would print the proper message.

And that should be pretty much it. One question that arises is whether all function calls from all nodes would need this. I think for the most part they should probably don't, and it's enough that this is added only to most of the core nodes and some nodes that may make sense.

What is the takeaway from this?

The positives of this approach are:

Easy to use
Very safe (practically impossible to mess up threading)
Handholding (get an error if you do something you are not supposed to).
Reasonably efficient (Job scheduler is more efficient, but infinitely easier to mess up).

If this enhancement will not be used often, can it be worked around with a few lines of script?

N/A

Is there a reason why this should be core and not an add-on in the asset library?

N/A

vpellen commented 1 year ago

Now comes the big question: What happens if this node or any of the children is processing on a thread and it attempts to call a function or property on another node that does not belong to this group?

The answer is you can't. The function will error and return immediately.

My instincts tell me that this is going to be the thing that'll make your life hell. How do you even begin to verify that kind of thread safety? Do you check every single read? Every function call? Every operation that could, even potentially, have an external dependency? Even if it can be pulled off, how much overhead is it going to cost? And how many projects are going to be able to reasonably benefit from this kind of optimization?

Don't get me wrong, I really like the concept, but I'm concerned that this is going to come down to a pick-two scenario between safe simple and fast.

reduz commented 1 year ago

@vpellen As mentioned above, its just a single check in most functions that matter or that have potential to mess things up (core scene nodes, maybe a few others). The check has literally no cost as it comes down to basically an unlikely() condition (hence no branch predictor fails) and GDScript can implement this automatically in the VM when it detects inheritance from Node. So, for the vast majority of Godot users, this is entirely transparent (another big benefit of having our own scripting language).

If you are developing CSharp or C++, then you will need to get used to add a call to this check yourself if you want to be safe, but you are assumed to understand better what you are doing anyway.

bitbrain commented 1 year ago

Really good proposal. Is there a way to control the number of threads (thread pool size)? Also, what happens if the CPU has too many threads to handle? Would certain threads just get ticked fewer times per second?

Mikeysax commented 1 year ago

Is this work needed before work on the other "Swarm Node" is done? Will this be a dependency on that other proposal?

Mickeon commented 1 year ago

Groups, groups, groups ... Would this be possible to incorporate with the already existing definition of Node Groups in the SceneTree? This would make certain functionality even more powerful (such as call_group), while still being easy to access, without bloating the Node class too much. Some users already group their Nodes by these groups. It's quite simple and familiar. On some occasions having the Nodes multi-thread may be as simple as adding them to a specific scene group.

reduz commented 1 year ago

@bitbrain the threads used are the ones for the Godot4 global threadpool, so it should be efficient already.

reduz commented 1 year ago

@Mickeon that is far more confusing to me. Name could be changed.

reduz commented 1 year ago

@Mikeysax I think these are unrelated

justinbarrett commented 1 year ago

EDIT: I think a lot of what I said has already been stated...I linked to this before there were as many comments. I think this is a good common sense goal for the engine, and makes logical sense on the surface....I worry about thread coherence, several threads could be finished and spitting out errors while they are waiting for other threads to finish and push their data...

a good thread scheduler would be paramount....should probably use smart system info and only do this if > 6 cores/12 threads.... 4 cores would probably be actually slower or negligible, since the cache hits might be an issue while waiting for jobs to finish..

reduz commented 1 year ago

@justinbarret Godot scene layer is mostly high level complex constructs for the most part, so while its not very cache efficient, it does not use a lot of memory instances actively either.

In other words, It will definitely be memory bound but this is not serious.

The main benefitted here will be certain operations that some nodes do that are a lot more memory efficient and cpu bound, such as animation playback, kinematic characters, etc. Which will now run in parallel.

justinbarrett commented 1 year ago

@justinbarret Godot scene layer is mostly high level complex constructs for the most part, so while its not very cache efficient, it does not use a lot of memory instances actively either.

In other words, It will definitely be memory bound but this is not serious.

The main benefitted here will be certain operations that some nodes do that are a lot more memory efficient and cpu bound, such as animation playback, kinematic characters, etc. Which will now run in parallel.

I think that would be an amazing starting point, just moving the animation and kinematic would, for me, already provide a nice speed bump...

reduz commented 1 year ago

@justinbarrett The problem is that, you could thread animation or physics by themselves (and this is my previous proposal I mentioned in this one), and gives you more control and it's more efficient, but my feeling is that these nodes work in the context of a scene and interact with other nodes, so synchronization is going to be hard.

If, instead, you run the whole scene in a thread and force to use messaging for synchronization, I think usability will be far better because the chance of messing up is far lower.

Galomortal47 commented 1 year ago

you could look at the GO language for a example of really user friendly multi threading.

justinbarrett commented 1 year ago

@justinbarrett The problem is that, you could thread animation or physics by themselves (and this is my previous proposal I mentioned in this one), and gives you more control and it's more efficient, but my feeling is that these nodes work in the context of a scene and interact with other nodes, so synchronization is going to be hard.

If, instead, you run the whole scene in a thread and force to use messaging for synchronization, I think usability will be far better because the chance of messing up is far lower.

iif you thread it by scene, you negate the performance increase in large open world games, where there may be fewer scenes..I, for instance, have nearly everything in one large world scene( I understand that is bad practice, but this is how I am also able to tell my limitations), if I am understanding correctly... I honestly don't think I understand fully..I'll let this rest..I trust you guys...you've gotten us this far.

alekh-ai commented 1 year ago

@reduz Does this impact on the lower end devices? If the game is built using this runs on older devices with a few CPU core?

Deozaan commented 1 year ago

What happens if the developer makes more thread groups than the player's CPU can handle? Do developers need to make different builds for different CPU threading capabilities, or does Godot handle splitting up the work as appropriate for the CPU automatically? I think maybe this is essentially the same question which was asked by @bitbrain but I'm not sure it was answered.

Calinou commented 1 year ago

What happens if the developer makes more thread groups than the player's CPU can handle? Do developers need to make different builds for different CPU threading capabilities, or does Godot handle splitting up the work as appropriate for the CPU automatically? I think maybe this is essentially the same question which was asked by @bitbrain but I'm not sure it was answered.

WorkerThreadPool handles the creation of the threads, whose parameters are set here.

dsge commented 1 year ago

Do we still always wait for every single one of the Nodes (inside every single one of these Groups) to finish updating before we start the next update cycle? Even if one of them stalls?

reduz commented 1 year ago

@justinbarrett

if you thread it by scene, you negate the performance increase in large open world games, where there may be fewer scenes

It should not matter, scenes can have thread groups inside of thread groups, its not limited to one nesting level.

@myaaaaaaaaa

If you generate a dependency graph by analyzing which GDScript functions call which nodes1, you can automatically run any nodes that don't depend on each other in parallel, which is how Makefile/scons multithreading works. A miniaturized version of this is how CPUs achieve instruction level parallelism.

This sounds fantastic in theory, but its extremely unpredictable in practice, plus you can't analyze what goes into C++ and its reentrancy. Additionally, it will most likely not net you much win given that at the end of the day everything is memory bound, so what parallelizes the best are cache efficient operations. In the case of Godot these are carried as specific tasks like animation, audio, pathfinding, etc. which with this proposed method will parallelize fine.

@Deozaan

What happens if the developer makes more thread groups than the player's CPU can handle?

This is not a problem, the threads are preallocated and just fetch tasks to solve whenever they become free. A group is just a task here.

RandomShaper commented 1 year ago

In order to robustly control which calls are safe from GDScript, I think we would benefit from a two levels approach. Level 1 is a whitelist of the classes that are aware of multithreading (most node types, most resource types, probably all builtin singletons, -user singletons excluded?-, etc.). Level 2 is each of the whitelisted classes doing their thread group checks.
How will the access to resources be made in this approach? Maybe it's a matter of making every resource read-only for the duration of the multi-threaded processing.
Some nodes may need to return some information from the multi-threaded processing. That can be done, for instance, via a user singleton. But that would force GDScript to provide annotations about the thread-safety of user singletons, that would make the thing a bit more complex than ideally. The alternative would be something provided by the engine that the connodes call safely call during threaded processing for later retrieval (something similar to the paged arrays the engine uses internally, but user-friendlier).
Some nodes may need to get non-thread-safe information before entering their group processing, which leads to similar thoughts to the previous point.
In order to give users predictability and control, there should be a well documented set of callbacks; for instance:
```
# These run first and are opportunities of collecting information for the multi-threaded part.
_idle_process()
_physics_process()

_threaded_process()

_post_threaded_process() # Similar to _idle_process(); a good time to act on information generated during threaded processing.
```
Now I think of it, GDScript will indeed need to be extended with annotations about thread-safety of the methods in user scripts. Or is this all meant only for internal processing? In any case, an AnimationPlayer can just call user scripts, which poses the same requirement after all.

Frontrider commented 1 year ago

This would be achieved by a single extra property in the Node class: thread_group, which would have the following values:
Inherit: Process in the same thread group as the parent node.
Group: Create a new thread group
Disabled: Does not process on threads.

Something that might help, but still requires user input. Add a bool to mark the node "pure". This tells godot that it only does operations on itself when _process is called. All of those can just be shoved into a queue and be done with on however many threads are available after all "impure" nodes finished. It only mutates itself, literally nothing matters. Make it false by default to avoid surprises.

Nodes must interact in one way or another, but it could enforce more "self contained" nodes, that are already a good practice.

call_deferred could also be used to queue up "mutations" for these pure nodes, and then we also get free threading support for interactions.

Mantissa-23 commented 4 months ago

...Has this just been silently implemented as of 4.2.x?

There is zero mention of these thread groups in the Using multiple threads docs page.

If processing groups work as advertised, I feel like we should minimally update the docs with an example. It's much more ergonomic than the original threads API, particularly for C# where the state of threading is somewhat unclear in the docs.

Happy to take that on if someone points me at the PR that added these.

Calinou commented 4 months ago

This has indeed been technically implemented by https://github.com/godotengine/godot/pull/75901 since 4.1, but it's currently incomplete and many nodes don't support threading when they should. This is most noticeably seen in AnimationTree (which is a common CPU bottleneck in scenes with hundreds or thousands of those active at once).

It's still worth documenting this, but you won't see massive performance gains in most cases just yet. I suggest creating a dedicated documentation page for it as opposed to extending Using multiple threads, as the use cases are pretty different (this approach is higher-level).

RandomShaper commented 4 months ago

I suggest creating a dedicated documentation page for it as opposed to extending Using multiple threads, as the use cases are pretty different (this approach is higher-level).

I agree. However, maybe a quick note with a link wouldn't harm, to cover the case that the reader is looking for ways to parallelize work precisely for performance reasons. Just an idea.

Deozaan commented 4 months ago

I also find it a little strange that the "Using multiple threads" docs page doesn't even mention the WorkerThreadPool class.

godotengine / godot-proposals