Simulator performance degradation over time

seitzseb commented 3 weeks ago

Detailed Description

The simulation performance slows down (from .33 to .29 rate), which could delay the agent’s real-time reaction and reduce sensor data accuracy.

Definition of Done

Simulation rate remains consistent over extended periods of time. Agent performance is unaffected by simulation slowdown.

Effort Estimate

8

Testability

Run prolonged simulations (e.g., 30 mins) to verify stable performance.

Dependencies

Simulator hardware and underlying software stability improvements.

vinzenzm commented 3 weeks ago

I opened a new issue about the stuttering issues (#446). Could this be related to this issue as well. Maybe because of the stuttering rosclock and carla clock drift apart to far which results in issues?

asamluka commented 3 weeks ago

It seems like the performance is different from maschine to maschine. If the car is stuck the perfomance doesn't decrease. Maybe we should specify, what part of the system slows down and what the reason for this is (e.g. the machine itself, the simulation or the car?) If we know what the issue is, we can debate about solving it.

vinzenzm commented 2 weeks ago

I opened a new issue about the stuttering issues (#446). Could this be related to this issue as well. Maybe because of the stuttering rosclock and carla clock drift apart to far which results in issues?

I was able to directly correlate these issues. If many of these WARNINGS (timestamp stuff) appear the ratio slowly goes down. Carla is started with 20Hz update rate. The ratio basically calculated in how many of these timesteps the Warning appears. This is because the carla-ros-bridge tries desperately to synchronize ros-clock with its simulation clock.

I will add acting acting label to this as well because I found different vehicle_controller setups and configurations which seem to run better on my machine. However I managed to start up a minimal example, which only starts one node, which sends empty CarlaEgoVehicleControl commands. And even this minimal solution didn't get close to the 1.00 mark. And when starting perception with this e.g. the ratio plumets drastically.

In my opinion fixing this issue completely is out of scope for us. We should of course keep an eye out for possible devastating performance issues. But we wont be able to solve this without completely rethinking our infrastructure.

This is sad because the Node-Concept of the ROS architecture tries to get rid of such cross-correlations of nodes. But the time-synchronization in our case brings back these dependencies.

ll7 commented 2 weeks ago

https://github.com/una-auxme/paf/issues/446#issuecomment-2461525134 might be relevant here as well.

vinzenzm commented 2 weeks ago

After quite a bit of time working on #446, there is no way to get rid of the performance issues completely, however improving sub-component performance will improve the situation.

In rqt you can select plugins-introspection-process monitor. At the moment these 5 nodes always lead the chart: -kalman_filter_node -vision_node -lidar_distance -pure_pursuit -stanley_controller

I will therefore add Perception group to this issue. The mentioned nodes should receive proper investigation on why they are using this many resources. We might be able to improve this.

una-auxme / paf