Mixing Wall time with simulation time in Costmap2DROS::mapUpdateLoop

ros-navigation / navigation2

ROS 2 Navigation Framework and System

https://nav2.org/

Other

2.42k stars 1.24k forks source link

Mixing Wall time with simulation time in Costmap2DROS::mapUpdateLoop #3325

Closed AzaelCicero closed 8 months ago

AzaelCicero commented 1 year ago

Bug report

Required Info:

Operating System:
- All
ROS2 Version:
- All
Version or commit hash:
- All
DDS implementation:
- All

Steps to reproduce issue

Just use costmap_2d node with simulation time, which has very different rate than wall time. You will observe that HZ of map update and map publish is not reaching the expected value and are far off the target.

Expected behavior

Configured rates of update and publish should be achieved in respect to simulation time.

Actual behavior

Map update is reaching expected rate in respect to wall time. Publish rate is bounded to the update rate, as it can only occur during the update, yet it is triggered in respect to simulation time. Depending on the mismatch of the rate between simulation and wall time you will observe different HZ of update and publish.

Additional information

Code should be written in respect to one time source, in this case I think that simulation time is better suited.

https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L461-L498

AlexeyMerzlyakov commented 1 year ago

Confirmed: the Costmap2D publisher works on very low rates for me locally:

[leha@leha-PC nav2_ws]$ ros2 topic hz /local_costmap/costmap
average rate: 1.666
    min: 0.600s max: 0.601s std dev: 0.00052s window: 3
average rate: 1.666
    min: 0.600s max: 0.601s std dev: 0.00043s window: 5
average rate: 1.666
    min: 0.600s max: 0.601s std dev: 0.00042s window: 7
average rate: 1.666
    min: 0.600s max: 0.601s std dev: 0.00040s window: 9
average rate: 1.617
    min: 0.600s max: 0.799s std dev: 0.05723s window: 11
average rate: 1.625
    min: 0.600s max: 0.799s std dev: 0.05307s window: 13
average rate: 1.630
    min: 0.600s max: 0.799s std dev: 0.04969s window: 15
average rate: 1.634
    min: 0.600s max: 0.799s std dev: 0.04687s window: 17

<- Average rate should be 5.0

[leha@leha-PC nav2_ws]$ ros2 topic hz /global_costmap/costmap
average rate: 0.500
    min: 2.000s max: 2.002s std dev: 0.00080s window: 2
average rate: 0.500
    min: 2.000s max: 2.002s std dev: 0.00065s window: 3
average rate: 0.500
    min: 2.000s max: 2.002s std dev: 0.00057s window: 4
average rate: 0.500
    min: 1.998s max: 2.002s std dev: 0.00116s window: 5
average rate: 0.500
    min: 1.998s max: 2.002s std dev: 0.00106s window: 6
average rate: 0.500
    min: 1.998s max: 2.002s std dev: 0.00099s window: 7
average rate: 0.500
    min: 1.998s max: 2.002s std dev: 0.00093s window: 8

<- Average rate should be 1.0

The situation when the problem appear became to be updateMap() performance cause. For local costmap, updateMap() execution time took ~1.0 sec, for global costmap - ~1.5 sec. In this case, we should record the time of main loop iteration start, and the time when it was ended. Then calculate remaining sleep time as sleep_time = 1/frequency - (cycle_end - cycle_start). If this time is negative, warn that we have to miss the cycle. In a good case - sleep(sleep_time). Then publish updated costmap. By the commented out warning, it seems that this problem was already met, but for some reasons the solution was not applied.

SteveMacenski commented 1 year ago

@AzaelCicero A PR is always appreciated.

For local costmap, updateMap() execution time took ~1.0 sec, for global costmap - ~1.5 sec.

If the issue is that the actual updating of the costmap is too long, that's not a problem with clocks / timers. That's trying to do too much "stuff" to be faster than real-time algorithmically.

sleep_time = 1/frequency - (cycle_end - cycle_start)

The ROS rate should be doing that for us to keep a regular frequency - unless there's something wrong with that.

AlexeyMerzlyakov commented 1 year ago

If the issue is that the actual updating of the costmap is too long, that's not a problem with clocks / timers. That's trying to do too much "stuff" to be faster than real-time algorithmically.

Yes, that was my oversight. The problem is not related to the performance, as I initially thought.

The ROS rate should be doing that for us to keep a regular frequency - unless there's something wrong with that.

Indeed the rclcpp::WallRate already does the same logic I've supposed in previous message. However, it works in system chrono time instead of simulation time that we need. So, there is a suggestion to write our own nav2_utils::WallRate as a ROS time based replacement with implementation that could be like follows:

class WallRate
{
public:
  WallRate(double rate, rclcpp::Clock::SharedPtr clock)
  : WallRate(rclcpp::Duration::from_seconds(1.0 / rate), clock)
  {}
  WallRate(rclcpp::Duration period, rclcpp::Clock::SharedPtr clock)
  : period_(period), last_point_(clock->now()), clock_(clock)
  {}

  bool sleep()
  {
    rclcpp::Time now = clock_->now();
    rclcpp::Time next_point = last_point_ + period_;

    // Correct time if now was shifted back away 
    if (now < last_point_) {
      next_point = now + period_;
    }

    // Check that now is before the next_point
    if (now >= next_point) {
      return false;
    }

    // Calculate time required to sleep
    const rclcpp::Duration sleep_time = next_point - now;

    // Sleep for necessary time
    while (clock_->now() - now < sleep_time) {
      std::this_thread::sleep_for(10ms);  // Sleep grain is 10ms
    }

    // Update last_point_
    last_point_ += period_;

    return true;
  }

  void reset()
  {
    last_point_ = clock_->now();
  }

  rclcpp::Duration period() const
  {
    return period_;
  }

protected:
  rclcpp::Duration period_;
  rclcpp::Time last_point_;
  rclcpp::Clock::SharedPtr clock_;
};

Locally, it works for me: when changing Gazebo sim rate from 1000 -> down to 200, the r.sleep() in mapUpdateLoop() will actually slowen down in x5 times in real time.

AzaelCicero commented 1 year ago

I have created PR which replaces the WallRate with ROS timer created with the node clock instance. I had to remove the functionality of waiting for the first publish, as now the update is done on the ROS executor loop. I am not sure of the purpose of this functionality, so I cannot predict what impact It may have.

I have tested this functionality in my use case where I am testing some planning algorithms on data replayed from ROS bags. Algorithms are using costmap as input. I have noticed that the frequency of costmap publishing is far off the expected. Then I started analyzing the code and found out that map update is controlled by system time, not simulation time.

With the change proposed by PR, the rate of map updates and map publishing is as expected. I think that solution proposed by @AlexeyMerzlyakov is also viable, with the remark that this WallRate with ROS clock should be used on a separate thread, not on ROS executor thread if SingleThreadedExecutor is in use. Otherwise, we would be waiting for clock progression which would never come as we would block receiving clock updates.

One test is not passing, I am going to fix that if you will give a green light to my PR.

AlexeyMerzlyakov commented 1 year ago

Just for noting: On the days I've got some strange effect for both proposed in this ticket approaches (switching to rclcpp::timer; or making our own Nav2's ROSRate instead of rclcpp::WallRate). The effect did not appear if we are slowing down the simulation in Gazebo: everything is working as expected. But while we are increasing Gazebo Physics -> real time update rate to 1500Hz or higher, both ROS and real measured times between two consequent costmap publications became to be unstable and shown incorrect jumping values. After some investigation, I've found that the reason of this issue is that Gazebo simulation time has some quantum of ROS-time: minimal value of ROS time to be increased on the simulation step. This step in default Gazebo configuration is 100ms. So high value is giving us a instabilities on high simulation frequencies. This ROS-time minimal increase value is placed in the gazebo_ros package and utilized as input frequency for Throttler(const double _hz) constructor. Default value passed to this constructor is being handled by publish_rate ROS-parameter of Gazebo. It is being set by-defalt to 10Hz (100ms), which causes the problem on higher rates.

Adding publish_rate:=100.0 (10ms ROS-time quantum) parameter while calling gzserver drastically improves the situation on high (1500+) gazebo real time update rates:

diff --git a/nav2_bringup/launch/tb3_simulation_launch.py b/nav2_bringup/launch/tb3_simulation_launch.py
index b4dfd3bd..7fdf35e5 100644
--- a/nav2_bringup/launch/tb3_simulation_launch.py
+++ b/nav2_bringup/launch/tb3_simulation_launch.py
@@ -160,7 +160,8 @@ def generate_launch_description():
     start_gazebo_server_cmd = ExecuteProcess(
         condition=IfCondition(use_simulator),
         cmd=['gzserver', '-s', 'libgazebo_ros_init.so',
-             '-s', 'libgazebo_ros_factory.so', world],
+             '-s', 'libgazebo_ros_factory.so', world,
+             '--ros-args', '--param', 'publish_rate:=100.0'],
         cwd=[launch_dir], output='screen')

     start_gazebo_client_cmd = ExecuteProcess(

AlexeyMerzlyakov commented 1 year ago

The above observations makes me think about the accuracies of both solutions from this ticket (switching to rclcpp::timer vs. having our own nav2_utils::ROSTimer). I've made some small investigation by measuring the time intervals (in ROS and real time) passed between two subsequent mapUpdateLoop() cycles. The measurements are summarized by error of average time from expected time, and standard deviation of this time. This was made for ROS time quantum (publish_rate Gazebo ROS-parameter) to be equal to 10ms and 2ms. For each ROS time quantum value, the simulation speed was varied by changing real_time_update_rate (https://classic.gazebosim.org/tutorials?tut=physics_params) in 200Hz..10KHz range. The summarized results are presented in the table below: Issue3325_approaches_accuracy

As follows from the results, both approaches (rclcpp::timer and nav2_utils::ROSTimer) are having comparable errors of real measured times between mapUpdateLoop() cycles on low and high real_time_update_rate frequencies (and thus on different CPU loads); when nav2_utils::ROSTimer is slightly overcomes rclcpp::timer approach.

However, when comparing the time errors of measured ROS-time between mapUpdateLoop() cycles, which we are initially targeting to hit in this issue, the undisputed leading method in this case is rclcpp::timer approach. It shows ~33x-130x lower time errors and notably better deviations (better results stability) comparing to nav2_utils::ROSTimer approach. Here is the most CPU-loaded case (ROS quantum time = 2ms; Gazebo real time update rate = 1KHz which is 10x faster than real world speed) for both approaches, depicted below: 2ms_quantum_10K_rate

Which makes rclcpp::timer approach the best for selection. I'll update #3327 with all review items from there (it seems the PR was abandoned for a long time) and prepare the final patch shortly.

SteveMacenski commented 11 months ago

Yay! rclcpp Rate merged in!

SteveMacenski commented 8 months ago

@AlexeyMerzlyakov what's the story with this ticket now? With the new Rate updates in rclcpp, can we use that now in the costmap thread and close out this ticket?

AlexeyMerzlyakov commented 8 months ago

@SteveMacenski, the scope of related works are reported in https://github.com/ros-planning/navigation2/issues/3303#issuecomment-1833302021. This issue is covered by #3303, as there are already discussing all the places need to be ROS-time respective, incl. new rclcpp::Rate API

SteveMacenski commented 8 months ago

From @AlexeyMerzlyakov translating to this ticket:

More general, by grepping where we are using std::chrono's now() and rclcpp::WallRate (the same problem caused by rclcpp WallRate, described in #3325), we have the following places where ROS timings are being ignored (tests were excluded):

Where	Type	Comment
nav2_behavior_tree/plugins/decorator/rate_controller.cpp	std::chrono
nav2_behavior_tree/src/behavior_tree_engine.cpp	rclcpp::WallRate
nav2_behaviors/include/nav2_behaviors/timed_behavior.hpp	rclcpp::WallRate
nav2_behaviors/plugins/wait.cpp	std::chrono	Discussed by #3303
nav2_controller/src/controller_server.cpp	rclcpp::WallRate
nav2_costmap_2d/src/costmap_2d_ros.cpp	rclcpp::WallRate	Covered by #3325
nav2_smac_planner/src/a_star.cpp	std::chrono
nav2_theta_star_planner/src/theta_star_planner.cpp	std::chrono
nav2_theta_star_planner/src/theta_star_planner.cpp	std::chrono
nav2_waypoint_follower/src/waypoint_follower.cpp	rclcpp::WallRate

SteveMacenski commented 8 months ago

@AlexeyMerzlyakov is https://github.com/ros-planning/navigation2/pull/3404 good to go with the rclcpp merge?

Or, does it make sense to instead just change https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/src/costmap_2d_ros.cpp#L496 WallRate to the new Rate with clock API and be done with the simple change now that we have it? The analysis in https://github.com/ros-planning/navigation2/issues/3325#issuecomment-1424379453 I'm not sure if its still relevant now that we have the new API in rclcpp

AlexeyMerzlyakov commented 8 months ago

@SteveMacenski, please refer to the https://github.com/ros-planning/navigation2/pull/3404#issuecomment-1842776234 which explains the intention

SteveMacenski commented 8 months ago

OK, so I think we're all squared away on this issue. I have decided that things like Costmap's loop rate, waypoint follower's loop rate, and the controller's loop rate should be tied to the system clock. Internal timings of systems are using ROS time when in simulation, but module-level operations should remain as tied to wall time.

This is because systems like the costmap and controller server are not systems that can be simulated at 2x-3x-10x speeds due to their computational complexity. The added variation due to the simulation clock is not worth the advantages and results in a poorer quality and explainability of navigation behavior at those higher speeds since you'll see problems not seen at 1x speeds. I'm not so dead set on this that I'm not open to considering this change in the future, however today I think it is not a worthwhile trade off. It also created a number of serious issues breaking testing behavior which given the above statements (which almost certainly impacts real simulation use-cases as well), after about half a day of debugging I ultimately decided wasn't worth continuing to address.

The behaviors (timed behavior, spin, drive on heading, backup, wait, etc), planners planning time, smoothers smoothing time, and rate controller have all been updated to use simulation time from the clock as internal behavioral matters. Costmap's & waypoint follower's loops; controller will remain tied to walltime since those are system level behaviors where changes in timing are important.

You can see that in the PR https://github.com/ros-planning/navigation2/pull/4000