Improve performance of Godot Physics Server 2D/3D active callback events

Ughuuu commented 2 months ago

Describe the project you are working on

Describe the problem or limitation you are having in your project

The 2 functions that take the most time are:

step
flush_queries

Notably the flush_queries one takes about as much as the step (for large enough number of objects). This is not right. The flow of the calls for flush_queries are as follows:

The RigidBody2D and RigidBody3D (and other nodes) call the PhysicsServer with body_set_state_sync_callback. This sets a callback to be called when this body is active. Then Godot calls back into the body to get some information, such as velocities, positions, sleep state. This happens at Godot side in the _sync_body_state function. This function is internal

Describe the feature / enhancement and how it helps to overcome the problem or limitation

I would propose the flush_queries would not flush the queries, but instead a different approach is taken.

Right now flush_queries calls the callbacks, and then it's the nodes job to get what info they need, decoupling things a lot. This is great for architecture design. However for performance it's not as efficient as it could be.

I propose we do this update in a larger batch, rather than one by one.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

I propose we have a function PhysicsServer::space_get_active_bodies(space_rid) -> Array[RID]. This function returns the RID of all active objects.

Then, we have another function, PhysicsServer::space_get_bodies_transforms(space_rid, array_bodies_rid) -> Array[Transform]. This function returns an array of transforms for the bodies.

Then, we have another function, PhysicsServer::space_get_bodies_linear_velocities(space_rid, array_bodies_rid) -> PackedVectorArray. This function returns an array of linear velocities for the bodies.

Same for angular velocity and is_sleeping.

The idea here would be we:

get active body RID's
iterate through all RID's and match then with actual object. Check which of these need the transform(eg. if a body is frozen, it might not need transform) and do a call to get these bodies transforms in bulk. Then, update the bodies state (modify the _sync_body_state function to instead of getting pointer to body_state to get the actual data it needs).

For this to work, there needs to be a place where we would store a reference to these nodes as to be able to iterate them quickly (eg. right now the flush_queries is called from main.cpp file. We would need to then make another call to some other service to send these info there(eg. get the nodes and update them)). Thile this might be a bit comlicated to implement, the benefits would be quite big for performance gains.

If this enhancement will not be used often, can it be worked around with a few lines of script?

A workaround for a person extending the physics server would be to offer such a function to get active bodies and a function to get the body state(eg. transform, velocity, etc). and then have them update the nodes manually. This is tricky as right now there is no way to disable node updates from outside afaik.

Eg. in 2d to do:

set_block_transform_notify(true);
set_global_transform(p_state->get_transform());
set_block_transform_notify(false);

Basically update the rigidbody or other such nodes properties without also propagating a physics update event. Another workaround woudl be to disable the physics server and put it in some mode for these updates, but all of this would be complicated for a workaround.

Is there a reason why this should be core and not an add-on in the asset library?

Very hard to do in an addon, needs to be core.

Ughuuu commented 2 months ago

As a means of providing data for this, I am benchmarking using Instruments app on mac to get the 2 largest times, step and flush_queries. I will focus on 3D for this.

Godot: goes up to about 3000 objects

step function 21%
flush_queries 6%

Jolt: goes up to about 7000 objects

step function 10%
flush_queries 10%

am running without dev as there is some issue with latest jolt and latest godot.

Rapier: goes up to about 5000 objects

step function 17%
flush_queries 37%

Note this is written with godot-rust so the bottleneck of the callable and everything is even higher.

Example improvement

I know it doesn't show exactly 1 to 1 comparison of what could be implemented, but I did in Rapier a change where I disabled the active callback: RapierPhysicsServer3D.body_set_state_sync_callback(body.get_rid(), Callable()) and am easily able to get 10k bodies simulated. I also made function that returns transforms of bodies and active bodies, and it's still fast. I know it's not the same as it does not update the bodies, but I'm sure some improvement would be gained for other physics engines too.

Another note, simulating Rapier locally (without godot) yields about 10-12k shapes. So the simulation itself should be a lot higher than 5k. The slowness comes from godot calling every callable 1 by 1.

About web

I know some might think, oh but isn't this enough performance already? No.

When running godot on the browser, divide all these numbers by 5 or so. So if on desktop you get 3000 objects in godot physics, on the browser you get maybe 500 objects. So on the browser if you want to get large numbers, you need to improve performance by a lot. Also, I am running locally on a macbook m2 pro. So I will get large numbers. But old android devices might struggle.

reduz commented 2 months ago

The problem is not the callback, the problem is that the whole scene system is not optimal to work like this. If you have so many objects use the servers directly and bypass the scene.

reduz commented 2 months ago

I still think it could be a good idea to profile what is going on at the scene side, though, there may be some silly bottlenecks there, but I think whathever it is, a single C function pointer callback should never have this impact.

mihe commented 2 months ago

It's also worth keeping in mind that if eliminating the Callable overhead is the goal here, then you also need to consider the overhead of interacting with Godot collections like Packed*Array from GDExtension, since pretty much every interaction with them (including append and operator[]) will have to cross the extension boundary (i.e. invoke a function pointer). I'm not sure how godot-rust implements Callable though, so it might still make a sizeable difference there, but for godot-cpp I wouldn't expect to see as big of a difference between the two approaches, assuming you're still fetching the same state.

iterate through all RID's and match then with actual object [...] For this to work, there needs to be a place where we would store a reference to these nodes

As talked about on Rocket Chat, the way you'd do this right now is to ask the physics server for the body's instance ID (ObjectID) and then fetch the corresponding node with that using ObjectDB::get_instance, which will involve some non-trivial overhead as well.

On top of that you need to somehow filter out the nodes that still need to have their _body_state_changed method invoked, since you still have things like _integrate_forces to invoke and contact signals to emit, for every active RigidBody*D. On the 3D side you also have PhysicalBone3D which similarly needs to be filtered out and have its specific _body_state_changed method invoked. The naive approach to this would be to use Object::cast_to I guess, but that involves dynamic_cast, which (last I checked) is quite slow.

Ughuuu commented 2 months ago

I still think it could be a good idea to profile what is going on at the scene side, though, there may be some silly bottlenecks there, but I think whathever it is, a single C function pointer callback should never have this impact.

I see, interesting approach. Will check this also then and see where the bottleneck comes from.

Ughuuu commented 2 months ago

It's also worth keeping in mind that if eliminating the Callable overhead is the goal here, then you also need to consider the overhead of interacting with Godot collections like Packed*Array from GDExtension, since pretty much every interaction with them (including append and operator[]) will have to cross the extension boundary (i.e. invoke a function pointer). I'm not sure how godot-rust implements Callable though, so it might still make a sizeable difference there, but for godot-cpp I wouldn't expect to see as big of a difference between the two approaches, assuming you're still fetching the same state.

That's a good point, maybe the overhead in rust comes from the way the gdextension works: flush_queries -> godot_rust(callables) -> godot -> godot_rust (get_data)

For these things, godot_rust needs to bind mutable self, and there are some things that add up. While if there was just: flush_queries -> godot_rust (get_data) -> do stuff

Would be better in that regard.

Ughuuu commented 2 months ago

As talked about on Rocket Chat, the way you'd do this right now is to ask the physics server for the body's instance ID (ObjectID) and then fetch the corresponding node with that using ObjectDB::get_instance, which will involve some non-trivial overhead as well.

On top of that you need to somehow filter out the nodes that still need to have their _body_state_changed method invoked, since you still have things like _integrate_forces to invoke and contact signals to emit, for every active RigidBody*D. On the 3D side you also have PhysicalBone3D which similarly needs to be filtered out and have its specific _body_state_changed method invoked. The naive approach to this would be to use Object::cast_to I guess, but that involves dynamic_cast, which (last I checked) is quite slow.

I think I would still leave that integrate_forces function, I would just change active_body callback.

Not 100% sure about it, but at least on the callstack (for godot-rapier) I do see some dynamic_casts happening in the investigation anyway.

mihe commented 2 months ago

For these things, godot_rust needs to bind mutable self, and there are some things that add up. While if there was just: flush_queries -> godot_rust (get_data) -> do stuff

Less work is better, of course, but given the Object-agnostic design of the server architecture in Godot I feel like there's always going to be one or two relatively expensive steps added before "do stuff" there.

I think I would still leave that integrate_forces function, I would just change active_body callback.

Right, but _integrate_forces (and the contact signals) are invoked as part of that callback, as seen here. So I would imagine your options are then to either keep invoking the callback from the physics server (somewhat defeating the point of all of this) or to have Godot invoke _body_state_changed (and thus _integrate_forces and the signals) directly on the bodies it gets from this new space_get_active_bodies.

That last option would presumably involve some kind of filtering/downcasting, since you'll have bodies of type BODY_MODE_RIGID, BODY_MODE_RIGID_LINEAR and BODY_MODE_KINEMATIC in that list. Not only that, some of the bodies could have been added directly through the physics server by the end-user, and their respective instance IDs might reference script-defined classes, or might not even be ObjectIDs at all, and instead just be some index into an array somewhere.

I do see some dynamic_casts happening in the investigation anyway

Object::cast_to is very prevalent in Godot.

godotengine / godot-proposals