ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.64k stars 1.31k forks source link

[Smac Planner Hybrid A*] Reeds-Shepp motion model not working: TF tree breaks, map->odom transform missing #4111

Closed SimonGiampy closed 9 months ago

SimonGiampy commented 9 months ago

[Smac Planner Hybrid A*] Reeds-Shepp motion model not working

Error description

I'm trying to do navigation with a skid-steering robot, working with both differential drive and Ackermann approximations for the local planner (controller server). What I want to achieve is to have the global plan generating trajectories in both forward and reverse directions, as in an Ackermann vehicle. Everything works fine and smoothly with my configuration, until I use Reeds-Shepp motion model parameter for the Hybrid A* algorithm in the global planner. With Dubin motion model the navigation works fine but trajectories are always in forward motion.

The error source is the parameter motion_model_for_search: "REEDS_SHEPP", required for enabling both reverse and forward motion in the global path planning algorithm. When I use this parameter the navigation stack doesn't work anymore. There is no apparent crash of any node, and everything seems to be loaded correctly, according to the initialization logs. The error consists in the TF breaking, because the link transform map->odom is not published anymore, therefore the localization fails.

Setup description and Parameters used

Required Info and Setup:

My configuration for NAV2:

I uploaded the yaml file renaming it as .txt (yaml file uploading not supported) to avoid information cluttering in this thread.

Configuration YAML file: simulation_reedssheep_not_working.yaml

This is the specific part of the parameters code that causes the error:


planner_server:
    ros__parameters:
        expected_planner_frequency: 20.0
        planner_plugins: ["GridBased"]
        GridBased:
            plugin: "nav2_smac_planner/SmacPlannerHybrid"
            motion_model_for_search: "REEDS_SHEPP" # Hybrid-A*: DUBIN | REEDS_SHEPP
            ......
            ... # the rest of the parameters for the planner server are the default ones

In particular, with motion_model_for_search: "REEDS_SHEPP", the localization doesn't work anymore. Using DUBIN as motion model, everything works fine.

Steps to reproduce issue

Use the parameters reported above to reproduce the bug. I'm not sure whether it's just that one parameter for the motion model causing the bug, or whether it is a specific combination of some parameters that makes the localization not working. I am also not 100% sure about this situation represents an actual bug or if I misunderstood the documentation about the Smac Planner for Hybrid A*.

This is the link to my repository containing the code for running NAV2 with both simulation environment and the real robot, in case it may be useful for replicating everything. My code repository

What I've already tried:

Since the error source seems to be in the global planner, I read the documentation thoroughly, including the documentation about the other NAV2 nodes, trying to find any possi9ble parameter interfering with the global trajectory planner.

The bug is reported in the log as the missing transform map->odom, which is be published by the localization node. I tried using both AMCL and SLAM_toolbox for localization in my tests, and the results were the same. So I can confirm the actual source of error doesn't come from the localization node, as the error log seems to show.

Expected behavior

Navigation working fine with correct TF tree and working localization.

Actual behavior

I report here a portion of the errors showing, immediately after NAV2 finishes initializing all the nodes and starts everything. The errors starts immediately after NAV2 stack is ready for navigation. The errors about the TF go on indefinitely.

[component_container_isolated-1] [INFO] [1707577757.161279748] [lifecycle_manager_navigation]: Managed nodes are active
[component_container_isolated-1] [INFO] [1707577757.161293460] [lifecycle_manager_navigation]: Creating bond timer...
[rviz2-2] [INFO] [1707577758.222847131] [rviz2]: Message Filter dropping message: frame 'odom' at time 14.960 for reason 'discarding message because the queue is full'
[rviz2-2] [INFO] [1707577758.423586705] [rviz2]: Message Filter dropping message: frame 'odom' at time 15.160 for reason 'discarding message because the queue is full'
[rviz2-2] [INFO] [1707577758.623443423] [rviz2]: Message Filter dropping message: frame 'odom' at time 15.360 for reason 'discarding message because the queue is full'
[component_container_isolated-1] [ERROR] [1707577758.783363905] [transformPoseInTargetFrame]: Extrapolation Error looking up target frame: Lookup would require extrapolation into the past.  Requested time 7.000000 but the earliest data is at time 7.460000, when looking up transform from frame [mobile_robot_base_link] to frame [map]
[component_container_isolated-1] 
[rviz2-2] [INFO] [1707577758.823267045] [rviz2]: Message Filter dropping message: frame 'odom' at time 15.560 for reason 'discarding message because the queue is full'
[rviz2-2] [INFO] [1707577759.023499559] [rviz2]: Message Filter dropping message: frame 'odom' at time 15.740 for reason 'discarding message because the queue is full'
[rviz2-2] [INFO] [1707577759.223200082] [rviz2]: Message Filter dropping message: frame 'odom' at time 15.940 for reason 'discarding message because the queue is full'
[rviz2-2] [INFO] [1707577759.423527860] [rviz2]: Message Filter dropping message: frame 'odom' at time 16.140 for reason 'discarding message because the queue is full'
[rviz2-2] [INFO] [1707577759.623158822] [rviz2]: Message Filter dropping message: frame 'odom' at time 16.320 for reason 'discarding message because the queue is full'
[component_container_isolated-1] [ERROR] [1707577759.783445425] [transformPoseInTargetFrame]: Extrapolation Error looking up target frame: Lookup would require extrapolation into the past.  Requested time 7.000000 but the earliest data is at time 8.440000, when looking up transform from frame [mobile_robot_base_link] to frame [map]
[component_container_isolated-1] 
[rviz2-2] [INFO] [1707577759.823543074] [rviz2]: Message Filter dropping message: frame 'odom' at time 16.520 for reason 'discarding message because the queue is full'

..........
SteveMacenski commented 9 months ago

The error consists in the TF breaking, because the link transform map->odom is not published anymore, therefore the localization fails.

The choice of planning algorithm nor setting within a planning algorithm will make that requirement or the presence of that transform change. I believe you have some other problem that is simply coincidence that changing that parameter triggers your problem (if it does, deterministically?). Clearly something is wrong more than your information provides, the planner server nor its internal planning plugins should not be interacting in any way with localization.

SteveMacenski commented 9 months ago

Any update?

SimonGiampy commented 9 months ago

The choice of planning algorithm nor setting within a planning algorithm will make that requirement or the presence of that transform change. I believe you have some other problem that is simply coincidence that changing that parameter triggers your problem (if it does, deterministically?). Clearly something is wrong more than your information provides, the planner server nor its internal planning plugins should not be interacting in any way with localization.

Working update:

After many failed attempts at finding the wrong parameter configuration causing the problem with the localization, I was finally able to successfully generate reverse motion trajectories with the Smac Hybrid A* planner.

As I mentioned in my first message, I couldn't make the Reeds-Shepp parameter work within the simulated environment. But then I had the idea to test everything on the real life robot in a real map, and everything magically worked. This is something that I didn't try before because I took for granted that if something doesn't work in simulation, it will never work on the real robot. Today is the day when this assumption demonstrated to be invalid, for the first time. It never happened to me that the "simulation-to-reality" gap was reversed, so my bad for not trying it beforehand.

The NAV2 configuration parameters used in both simulated and real environments, are practically identical, and the only differences are the names of the topics used for the sensors. So I was able to make the configuration work with the real robot without any changes.

Hypotheses

The TF tree breaks in the map->odom link, but the link breaking first seems to be actually odom->base_link. So it's still not clear to me whether the localization breaks, or the odometry breaks. During the tests I've conducted, the TF tree always breaks (so yes, it is deterministic), and since no clear errors arise, it's difficult to tell exactly what went wrong. My guess is that the problem is at odometry level. And the odometry is provided by Ignition Gazebo 6 via the differential drive odometry plugin.

My final guess:

When loading the motion model "REEDS_SHEPP", NAV2 takes longer to load, and that may be correlated to the problem cause. So when I use that parameter in the simulation, there is something obscure that causes a faulty interaction between NAV2 and the odometry plugin in gazebo (along with the bridge), which in turn breaks the TF tree. If it's not this I don't know what it is.

I've done a lot of research, and I know very well that these modules must not affect each other, and that they must be completely not correlated, but this is what I found, and I'm totally sure of my findings. So I'm perfectly aware that it is very strange that the odometry from Gazebo could cause directly or indirectly this problem, but this is my best guess, with the knowledge that I have.

Conclusions

Summing up:

I will not mark this issue as solved since I technically didn't solve it yet, and because I only found a different case scenario where the problem doesn't arise. It is also not really important for me to have this issue solved in the simulated environment, because I actually care more about the real robot.

SteveMacenski commented 9 months ago

When loading the motion model "REEDS_SHEPP", NAV2 takes longer to load, and that may be correlated to the problem cause.

This is true, due to the lookup table calculation on initialization. It would be fragile if whatever your application is based on timing and not events like lifecycle transitions.

Closing since this isn't a bug in Nav2 and shown to work fine. Its potentially rooted in your application software or the simulator (either way to be taken up with the respective party). If it something in the simulator, feel free to tag me in that new ticket to track and help over there.