godot-rust / gdext

Rust bindings for Godot 4
https://mastodon.gamedev.place/@GodotRust
Mozilla Public License 2.0
2.91k stars 178 forks source link

Too many calls to `object_get_instance_binding()` #793

Closed Ughuuu closed 2 weeks ago

Ughuuu commented 1 month ago

In my performance investigation I see a lot of usage of: PhysicsDirectBodyState2D -> get_transform, get_velocity ...

For these, the times are much higher than in the C++ equivalent. I even make the functions be no-op, but the overhead is still there.

What I see is Godot calls inside PhysicsDirectBodyState2D 5 times, with 5 different functions, and in 3D mode about 6 times. Old version of plugin was able to do 8800 circles, so that would be for 2D 8000 * 5 times per second.

The function right now looks like this:

#[godot_api]
impl IPhysicsDirectBodyState2DExtension for RapierDirectBodyState2D {
    fn get_transform(&self) -> Transform2D {
        Transform2D::IDENTITY
    }
}

I would like to have a more efficient way of doing this, if possible.

lilizoey commented 1 month ago

One way to significantly help this issue is with custom user-data wrappers. This would let the user pick a calling convention here that reduces overhead, potentially even makes it a no-op.

Making it possible to pick what receiver you want here would also help. Using Gd<Self> instead for instance should reduce overhead by avoiding bind/bind_mut. But im not entirely sure of a good way to do that, a couple of ways would be:

  1. wait for arbitrary self types, and be generic over Receiver (this will likely take a very long time to stabilize)
  2. add more traits for different receivers (adds a bunch of more traits)
  3. use proc-macro magic (wont work that well with the traits)
  4. use builder-api

In my opinion the builder api is probably the best solution out of these, but it is a lot of work. It might also add a little bit of overhead during start-up, but that's a one-time cost.

Ughuuu commented 1 month ago

Is there one of these I could possibliy do in a hacky way on a fork of this repo? That you know of. I'm not sure many people would care that much about performance in this case, that's why I ask(in case this never gets to see the light of day)

Bromeon commented 1 month ago

Did you evaluate in this specific problem what causes the majority of calls in this particular problem? In your link you have lots of ranked lists, but it's not clear which exactly applies here.

Is it GdRef/GdMut guards? If yes, which operations specifically?

Ughuuu commented 1 month ago

I'm not 100% sure I understand the question or know what GdRef/GdMut usage has to do with my problem, but basically godot is calling into the dll in PhysicsDirectBodyState2D::get_transform, PhysicsDirectBodyState2D::get_linear_velocity, PhysicsDirectBodyState2D::get_angular_velocity, PhysicsDirectBodyState2D::is_sleeping, PhysicsDirectBodyState2D::get_contacts_debug.

I implement these functions like this:

#[godot_api]
impl IPhysicsDirectBodyState2DExtension for RapierDirectBodyState2D {
    fn get_transform(&self) -> Transform2D {
        Transform2D::IDENTITY
    }
    fn get_linear_velocity(&self) -> Vector2 {
        Vector2::default()
    }
...
}

So, as a test, I am doing basically no-op to see what the problem is from. And there is still a very big performance hit. These functions are called for every active object. So in the case I have 8000 objects, the lib gets called from godot 8000 * 5 times (for 3d its more as for 3d there is also inertia_tensor and principal_axis functions that are called).

So the problem seems to be the linking of the &self and &mut_self` of this struct. I still would need a way to match this struct to something though, as internally I would get the object from a hashmap and then get the transform/velocity/other_states.

Even more, if I am looking at the percentage of the function using rust and the one using c++, the time it takes to call this function for the rust one is very big compared to the c++ one. Eg. without this and the bug about Array cloning(tested by disabling this flow completely, not calling some callbacks into godot) I am getting in 3D use case of rapier 8k-9k objects, for 3D getting 10k-11k.

Bromeon commented 1 month ago

I'm not 100% sure I understand the question or know what GdRef/GdMut usage has to do with my problem

The &self in your virtual method needs to come from somewhere. Behind the scenes, there is a Gd<RapierDirectBodyState2D>, on which bind() is called -- bind() returns a guard of type GdRef. Doing this performs a runtime borrow check to ensure there's no concurrent write active.

In your case, you wouldn't need this since you don't access self but instead return constants.


So if you run the code that calls these virtual functions through a profiler, what are the top contenders for the CPU time spent? To me it wasn't clear if this is listed in your analysis linked in the initial post, and if so, where exactly.

Ughuuu commented 1 month ago
  1. I would still need to get the data from somewhere, but at least I would not do that bind thing. I am hoping to get the data in a faster way.
  2. The bottlenecks happen from:
    • The flush_queries function. This function is a function of physics_server_2d and is called from Godot. From this function I call callable.callv and then Godot calls back into those 5 functions I listed above 8000 times.

From investigation, at Main -> Flow1 -> Callable::callv (1.43s) is where the first callable happens, that then calls into godot and gets called back 5 times: image

Then, at Main -> Flow1 -> Callable::callv (1.43s) --> 1. you can see how one of these callbacks looks like: image

Both from this and other one, the most time spent is:

Note:

As can be seen, a big chunk of the code execution comes from RawGd::bind.

Ughuuu commented 1 month ago

After a lot of investigations, it appears the RawGd::bind method calls internally get_instance_binding and that uses a lock. Instead we can cache that value.

Investigation of other cases (godot-cpp with c++, godot-rust with rust). Both are release builds.

5 seconds timeframe, flush_queries

Box2D: 1.12s (22%). - 7000 balls 1-3 fps

Interesting to see here that out of total of 22% from flush_queries, 10% is from body_state_changed, and the get_linear_velocity is at 0.6%.

Godot-Rust without this change: 2.66s (47%). - 7000 balls 4 fps

Godot-Rust with this change: 2.61s (47%). - 7000 balls 8-10 fps

Notice how the total time of the flush_queries function didn't really go down. However this might be because the fps affects this too. Since fps is down, the callv function itself isn't called that many times.

Will update with times from less balls so they all have 60fps, enough time to call the Callable::callv same amount of times.

3000 Balls with godot in release mode

Physics Callable::callv body_state_changed
Box2d 6.3% 5%
Rapier without change 11.4% 8.5%
Rapier with change 11.3% 8.1%
Bromeon commented 2 weeks ago

Closed by #831.