Add threading support to Scene Nodes

reduz commented 2 years ago

Describe the project you are working on

Godot

Describe the problem or limitation you are having in your project

Godot 4 does a lot of improvements on the threading side for performance, but this happens mostly in the servers. Scene node processing performance is unaffected and all processing is still single threaded.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Allow scene nodes to, optionally, process on threads so processing can happen for many at the same time.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Here is a proposed implementation:

C++ API

// Enable thread processing
void Node::set_thread_process(bool p_enable);
void Node::set_thread_physics_process(bool p_enable);

// Used to setup threading
NOTIFICATION_THREAD_PROCESS_PREPARE - called as _thread_process_prepare(delta) 
NOTIFICATION_THREAD_PHYSICS_PROCESS_PREPARE - called as _thread_pphysics_process_prepare(delta) 

// Runs in thread
NOTIFICATION_THREAD_PROCESS - called as _thread_process(delta) 
NOTIFICATION_THREAD_PHYSICS_PROCESS - called as _thread_physics_process(delta) 

// Runs after thread
NOTIFICATION_THREAD_PROCESS_DONE - called as _thread_process_done(delta) 
NOTIFICATION_THREAD_PHYSICS_PROCESS_DONE - called as _thread_physics_process_done(delta) 

// Control priority (as example, you know a skeleton may always need to process before animation)
void Node::set_thread_process_dependency(Node *p_previous);
void Node::set_thread_fixed_process_dependency(Node *p_previous);

void Node::set_thread_process_priority(int p_priority);

// Accessing data locked by thread
void Node::thread_process_lock(); 
void Node::thread_process_unlock();

// could probably have a _THREAD_PROCESS_LOCK_ for node functions. GDScript could have an anntation @node_thread_safe in the function.

// Signals

// Emited from the main thread after processing is happening
thread_physics_process()
thread_process()

// Emited from the main thread after processing is complete.
thread_physics_process_done()
thread_process_done ()

GDScript usage example

extends Node 

func _thread_process_prepare(delta):

     do_something()
     await thread_process
     work_in_thread()     
     await thread_process_done
     read_results()

Nodes that would benefit from processing on threads

Applicable nodes:

AnimationPlayer
AnimationTree
CharacterBody (with some manual work or changes)
CPUParticles*
CSGNodes
Skeleton3D
RayCast*
ShapeCast*
SpringArm3D

Godot 4.0 or 4.1?

This was planned for 4.1, but with the implementation of WorkerThreadPool (that had to be made in order to have proper threaded HTML5 support), implementing this became relatively low hanging fruit and it could be done for 4.0 as part of the optimization effort planned after Beta, which will consist in creating a large amount of benchmarks and optimizing as much as possible before RC. It remains to be discussed and ultimately decided.

FAQ: Q: What is the difference between this an an ECS? A: This approach is not as cache-friendly per se, because in general Godot nodes do complex, higher level processing of not as many nodes. Because of this, they don't benefit as much from memory bandwidth improvement as they benefit from threading. Something more Akin to ECS style improvements is proposed in #2380. Q: What is the difference between this and a job scheduler? A: Entirely job scheduler based engines are far more complex to use, so Godot does some cheats by relying a bit more in the underlying operating system. Physics, Rendering and Scene are able to run in parallel in separate threads, so there is no need to schedule physics, rendering and game logic into a same graph by the user (although they do get scheduled internally efficiently inside WorkerThreadPool). This allows to greatly simplify threading code.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No, this needs to be core.

Is there a reason why this should be core and not an add-on in the asset library?

N/A.

AlexDarigan commented 2 years ago

For another use-case, I've been running different threads under separate branches in the SceneTree when WAT is being run in a multi-threaded mode, so I'm sure this could improve it.

derammo commented 2 years ago

This is very clever. In generic terms, I am going to say await thread_process is explicitly switching thread pool within the body of a coroutine. happens_before then requires a memory barrier, which is what I guess you meant by less cache-friendly. I wouldn't worry about that even a tiny bit. If you are going through the trouble of putting the middle block on other threads, the work must be non-trivial, so who cares about one slightly expensive switch?

My other observation is that I am old. To me, it would be more intuitive to have a separate explicit function for the work and await its function state. Instead of awaiting thread_process, you'd await the result of a function that is annotated as being available for execution in another thread. Yes, that needs language support, but allowing threads into nodes is a big deal. Function style makes it very explicit what is happening and guides the developer to minimize and be aware of the state that is passed across the thread boundary. By making your design as sexy as possible, it makes it a lot less obvious where the thread issues lurk and danger is located.

In summary, you know better if coroutines that skip threads are something that Game folks find intuitive. I'm not representative. Maybe some syntactic sugar could be added to the language that allows you to make the "middle block" that runs in a thread look like some type of scope (Lambda.) Kinda like the scope around an auto lock that tells you where to be careful.

Also, if you do this, I am going to use the living hell out of it :). It's very sexy, and maybe I will just grow to love the style.

derammo commented 2 years ago

Follow-up: Yes, I can just wrap calling a function and awaiting it around this mechanism. My comment is more about what code style would be most intuitive and safe to bring to your user base.

reduz commented 2 years ago

@derammo well, the proper way to do it would be simply overriding the three functions, so this is mostly a hack of the kind GDScript users like to do to keep everything in one function.

derammo commented 2 years ago

Oh :). Yeah that would be explained in the docs :). I do think people in the general purpose programming world need help with the passing of state in and out of threads. The structure of the "normal" way of doing things that they copy from StackOverflow should somehow make it clear which things they need to safely make available to their thread and what they get back and when that result is available. JavaScript async routines still stump people, even though they made it so easy! I guess Game folks can probably handle it, but it does look dangerously like the code they are used to "just working" where you await states in your game and then write straight code and never worry about threads. WIth this, suddenly something that looks exactly the same is threaded. Something to think about maybe.

derammo commented 2 years ago

If I override the three functions, then I lose all support for getting my result back, right? Can I please return user data from prepare, get it passed to me in process, and get it back in done (or whatever is the first thing on main thread after my task is done?)

derammo commented 2 years ago

set_thread_process_dependency et alia could return an error if you can statically determine there is a cycle created by adding this extra link?

derammo commented 2 years ago

_thread_process_prepare and _thread_process names make total sense for what they do in C++ but result in weird looking script code. In your canonical example, _thread_process_prepare is the only virtual you actually implement. If _thread_process_prepare is only called once (i.e. you can't restart it) then it is very much like a threaded version of _process from the point of view of the user. So that's probably what should be called _thread_process, with the "middle function" renamed to something like _thread_work or something like that.

Otherwise, you are "preparing" something you don't even need to know exists. Yes that would mean renaming the signals somehow too.

Zylann commented 2 years ago

It seems this proposal doesn't allow a node to process both in thread and not in thread. It might be an issue that would force users to separate some of their logic to another node for the sake of it, or not do it at all due to the effort required.

If one node does a lot of work, it seems it will only be able to use one thread. Is it then intented to have to create multiple nodes just to subdivide the work?

Dependencies help structuring the flow of what must run in serial and what can run in parallel within a set of threaded tasks, but requiring nodes means that you require access to them in the scene, forcing you to have a monolithic node structure. An alternative is to allow specifying which "processing group" a node belongs to, so it can be depended upon without having to have access to specific node instances. It might be possible to use both.

Mistakes in threaded code are incredibly complex to figure out. Bugs can be wild and random. It matters here because the scene tree is where most Godot users live in. Dependencies help but they dont prevent such mistakes. In Unity the way jobs are defined allows to write annotations to tell what is read-only and write-only, which helps detecting errors, but in Godot a node can read and modify pretty much anything anywhere, which makes tracking mistakes quite hard.

Curious to know why the whole thing needs to be core and not just the worker system. The only reason I can see is wanting to couple to the scene tree and nodes in particular, but if a "threaded task" can be defined as any class with an "execute" function, it becomes much easier to be a more decoupled system, more similar to existing job systems (which is what I mainly use). But I guess you wanted to avoid that because it is complex. Not sure if this is less complex though. It just sounds like the same thing with some of the features being fixed to scene tree features so it might be less code to write, with less flexibility. I have experience in other multithread task systems so I might prefer using something else.

The optimal number of active concurrent threads is a small number, usually twice the number of CPU cores. Beyond that number, performance may be affected negatively. I see this proposal is based on a worker thread pool. I use a thread pool for my voxel engine too, but I cannot switch to using the same pool because it works differently and would force a dependency on Godot. Will there be a way to decide how many threads are dedicated to Godot's worker pool? The problem is if it uses all of them, it might lead to contention as more threads are used than what the CPU can handle. More generally, it's also a good idea to not use all CPU cores because players often have something else running than the game they are playing (music, voicechat, stream...) and in multiplayer games it's not rare to have to run multiple instances of the game on the same machine.

If threaded process becomes common, it asks for Godot to also provide multithreaded debugger and profiler support. On top of that, a timeline/flamegraph view in the profiler becomes even more relevant, because only that can give a visual insight of what happens in parallel or serial over multiple threads (example of one with Tracy). Other features would be to expose lacking RWLocks, ThreadLocal and maybe also thread index to GDScript, to complement synchronization needs.

Minor: would be nice if Godot named its threads. When debugging the engine there are a lot of them with a cryptic default name so it's time consuming to spot which one is which.

reduz commented 2 years ago

@Zylann

It seems this proposal doesn't allow a node to process both in thread and not in thread. It might be an issue that would force users to separate some of their logic to another node for the sake of it, or not do it at all due to the effort required.

No reason why you can't do this with the above proposal.

If one node does a lot of work, it seems it will only be able to use one thread. Is it then intented to have to create multiple nodes just to subdivide the work?

The new WorkerThreadPool lets you spawn worker threads from worker threads, so this is not a problem.

Dependencies help structuring the flow of what must run in serial and what can run in parallel within a set of threaded tasks, but requiring nodes means that you require access to them in the scene, forcing you to have a monolithic node structure. An alternative is to allow specifying which "processing group" a node belongs to, so it can be depended upon without having to have access to specific node instances. It might be possible to use both.

The above proposal is for nodes, to aid on the whole thing being more organized, but nothing avoids you from creating tasks and having them be dependent on your own outside of this.

Mistakes in threaded code are incredibly complex to figure out. Bugs can be wild and random. It matters here because the scene tree is where most Godot users live in. Dependencies help but they dont prevent such mistakes. In Unity the way jobs are defined allows to write annotations to tell what is read-only and write-only, which helps detecting errors, but in Godot a node can read and modify pretty much anything anywhere, which makes tracking mistakes quite hard.

I think this proposal needs to be accompanied with some added thread safety in the base node classes (Node, CanvasItem,Node2D, Node3D, etc). but that said, the whole point of the above proposal is that this happens after (or before) all the regular processing happened, and that you do all the scene access in either the pre or post stages, not the process stage.

Curious to know why the whole thing needs to be core and not just the worker system. The only reason I can see is wanting to couple to the scene tree and nodes in particular, but if a "threaded task" can be defined as any class with an "execute" function, it becomes much easier to be a more decoupled system, more similar to existing job systems (which is what I mainly use).

Thats entirely how it works already, you can organize tasks in WorkerThreadPool on your own if you don't want to use nodes. This proposal is explicitly for optimizing nodes.

The optimal number of active concurrent threads is a small number, usually twice the number of CPU cores. Beyond that number, performance may be affected negatively. I see this proposal is based on a worker thread pool. I use a thread pool for my voxel engine too, but I cannot switch to using the same pool because it works differently and would force a dependency on Godot. Will there be a way to decide how many threads are dedicated to Godot's worker pool?

Yes, when you create a group you can already specify how many elements, solver task and threads for it. You can also specify if you want them to be real-time (expected to be for intra-frame work) or not (its fine if takes more frames).

If threaded process becomes common, it asks for Godot to also provide multithreaded debugger and profiler support. On top of that, a timeline/flamegraph view in the profiler becomes even more relevant, because only that can give a visual insight of what happens in parallel or serial over multiple threads

Minor: would be nice if Godot named its threads. When debugging the engine there are a lot of them with a cryptic default name so it's time consuming to spot which one is which.

You can name your tasks, so eventually the plan is to show a proper timeline with tasks executed over time in the Godot profiler/debugger.

RandomShaper commented 2 years ago

This would look deceively simple. I mean, many not very advanced users would be appealed to it, not aware of the many potential traps.

For games that can really benefit from heavy threading, I'd rather let them use the same low and high level threading building blocks the engine has for itself, including the thread pool.

However, threaded internal processing for some node types for the engine itself doesn't sound bad, being transparent for users.

reduz commented 2 years ago

@RandomShaper To avoid problems this is why I think its important that there are three steps: setup, process and done. Only the middle one runs in a thread and user is not expected to not access anything from it (this can be very well specified in the docs).

this would need to be accompanied to some changes to the base Node classes so their are more thread safe.

Zylann commented 2 years ago

You can name your tasks

@reduz that doesnt apply if you are profiling/debugging the engine though. Naming tasks is great, but I meant naming threads because that's the only thing that appears in a debugger/profiler. It's not essential to this proposal though, just figured I'd add a quick mention of it.

the plan is to show a proper timeline with tasks executed over time in the Godot profiler/debugger.

Would that only be specific to this system? I wish the profiler was able to show the full picture and not just a specific system only.

reduz commented 2 years ago

@Zylann All thread usage in Godot is being moved to this system.

RandomShaper commented 2 years ago

As long as users are made aware how advanced this is and that their fate is in their hands, I'm not against it. Aside, despite this doesn't need a lot of code to be added to the engine, I still wonder how many real use cases will there be to justify the maintenance.

derammo commented 2 years ago

[I broke up my previous questions as a bunch of separate posts to avoid TL;DR, not sure if that worked out. Hopefully when you have some time you can respond to them.]

To RandomShaper's comment regarding it being deceptively simple, I agree 100%. Sometimes shorter code isn't better, like when you are actually doing something that is intrinsically hard to get right. Sometimes it's better to expose that and make people make choices consciously.

Here are some alternate ideas for the "process" phase. Just throwing them out there to see if anyone likes it better.

1) A lambda that you have to pass everything into that you want to access from self. Not even self is available by default. This would be syntactic sugar for a call to a static function in the Node class where you use it.

If you ARE able to mutex things or make sure you don't access them from main during the time you are threaded, then you can make that clear by passing them in. On the way out, you get something passed back to you (the work product) as a result value that you don't have to mutex because the thread is done with it.

2) A mode (or maybe the only mode, I don't know all the use cases) where main cannot call anything on this node while it is active on the thread (in the process phase.) This means you can't parallelize with yourself, only against work in other nodes. Everything in self is implicitly mutexed in this way. If you want to do stuff in parallel, make sure you design a small enough Node for it that it doesn't also have to field signals or calls at the same time. The "spooky thread stuff" boundary is the entire node from the user's perspective, but it would be hard to break anything as long as you don't call out of the node.

[edit] To clarify, this means no calls to _process, and all deferred calls would be queued. Any synchronous calls would block, which we could detect and report as probably bad design.

3) What I already asked for above, passing user data between the three phases you have already defined, so that you can just pass what you need from main to worker and back without mutexing anything.

Zylann commented 2 years ago

All thread usage in Godot is being moved to this system.

@reduz so if I (or a third party library) dont use a Godot worker pool I don't get to see anything in the profiler? If thats the case I feel it's the wrong justification (for profiling)... a profiler should allow instrumenting any code region and not litterally depend on on something that specific

reduz commented 2 years ago

@reduz so if I (or a third party library) dont use a Godot worker pool I don't get to see anything in the profiler?

I think that is to be expected

Zylann commented 2 years ago

@reduz It is not. When I use Tracy, a third-party profiler, I can report scopes to the profiler without Tracy depencing on Godot in the first place. So it follows that I should be able to report functions running in my own threads to Godot's profiler. And so can functions running in worker threads. Hence why I believe that dependency is unnecessary. Or maybe, the new feature is not meant to be what I wish Godot to have: a profiler that can indiscriminately report code sections regardless of their thread or language (in turn helping this proposal), instead of a specialized system only showing a part of what's happening in Godot.

reduz commented 2 years ago

@Zylann Oh sorry, I misunderstood what you meant. Sure, it should be fully possible to add this.

derammo commented 2 years ago

// Emited from the main thread after processing is happening
thread_physics_process()
thread_process()

Emitted from the worker thread, right?

reduz commented 2 years ago

@derammo yeah

MarianoGnu commented 2 years ago

TLDR, will finish reading the proposal and then come back with apropriate feedback. From my part the way i ever imagined threading process was using "call_deferred" or "class_deferred_async" to run the method in a thread. Right now call_deferred would always run the methods on the main thread, but i see no reason why the "job scheduler" could not use a message queue the same way the main thread does to sort, start, and complete small jobs in a separate thread

MarianoGnu commented 2 years ago

I have some more questins not clarified in the QA section ofthis proposal: Q: wht happens if (when) objects emmits signals in a thread that is not the main thread, should they be deferred to the main thread? should execute the code in the separate thread? should the connect method have flags to set what would be the behaviour there? Q: It is cool to have this running in Nodes, but because the life cycle of this time frames (process and pysics process) are not attached to the life cycle of the object i was wondering if we could use this to give scripts extending Resource or even Reference a way to tick (right now you have to connect to SceneTree.idle_frame) Q: This code is confusing, it seems to be doing everything just in prepare, also my understanding is this methid would run the _process_thread Callable every frame, but i don't see a way to dispatching a "done" event other than calling set_thread_process(false); Is this a real case scenario of how this method is meant to be used?

extends Node 

func _thread_process_prepare(delta):

     do_something()
     await thread_process
     work_in_thread()     
     await thread_process_done
     read_results()

Q: this mock says the process method is emmited from the main thread, when i expected it to be running in a job thread, was this a mistake or i am just not understanding the call chain? my understanding was "prepare and done happens on main thread, and process happens on job thread"

// Emited from the main thread after processing is happening
thread_physics_process()
thread_process()

derammo commented 2 years ago

@MarianoGnu for the last question, look up 3 posts. Confirmed typo, just not corrected.

oeleo1 commented 1 year ago

Hi,

All this looks cool except for the inherent Dark Side of threads. One usually doesn't look into threading, unless there is a crying need for performance optimization and concurrent execution. So if we ever engage into threading, we have determined that we are in need of performance optimization. If the multi-threading delivers on its promise, we overcome the Dark Side and we come out as winners. If it doesn't, we remain stuck in questionable optimizations.

Here is a classic Godot use case in need of performance optimization. I would be interested to know how this proposal is going to help in such elementary use case. Take it as a little challenge to elaborate on the proposal.

When we have a BigScene, we have 2 hot spots which are a killer in a mono-threaded environment:

Loading the BigScene
Adding the BigScene to the SceneTree

The optimizations available to us today, wrt the above two points, are:

Background loading of the BigScene
None

With background loading, we use successfully threading techniques to parallelize processing and optimize performance. We load and build the scene while we render something else, entertaining, on the screen, coupled with some nice background music.

With regard to adding the scene to the tree, we are stuck big time. As soon as we do add_child(BigScene) we trigger the _enter_tree() initializer functions top-down per node + the _ready() functions bottom-up which may be quite elaborate. For a BigScene, all this happens within 1 (one) rendering/idle frame, resulting in a big lag or game freeze which can span up to seconds, depending on the complexity of the BigScene. So clearly, we need to optimize this BigScene simplistic approach.

The obvious optimization is to break up the BigScene into smaller ones, and/or serialize some of the BigScene component construction. That's the classic immediate approach. But if you think about it, what we really want is to say add_child(BigScene) in a thread, concurrent to the main thread in charge of rendering so that we don't interrupt the ongoing entertainment. We just need some signal that the BigScene has been added successfully to the SceneTree by the thread, and its initializations performed by that secondary thread. Yes, it is tricky for the core to do so, but that's what the use case is.

So how is this proposal going to help me in this case? While I see fairly well the threading aspect of already initialized nodes, I fail to see how this is going to help with the problem of adding BigScenes to the tree, currently done atomically within the main thread.

Thanks.

Faolan-Rad commented 1 year ago

My engine is completely multi threaded and I use Godot nodes as a rendering system but I had to queue of events that will then run on the main thread at the start of the next frame How I have to pass values to nodes currently

addmix commented 6 months ago

Would this proposal also aid physics performance by parallelizing _integrate_forces()? Or would that require parallelization on the side of the physics server? Would there need to be a distinction between standard single threaded tree functions (_process, _physics_process, _integrate_forces), and their multithreaded counterparts to preserve compatibility?

godotengine / godot-proposals