Flaky Planner Tests (part 3)

SteveMacenski commented 4 years ago

@AlexeyMerzlyakov @siddhya looks like this continues to be a problem.

I'd love it if you guys worked together on fixing this one since it seems to escape us individually.

AlexeyMerzlyakov commented 4 years ago

First news: this is not a flaky problem and does not related to closed #1585 ticket. The problem is a test false-positive case uncovered by ee719c9f commit. In this commit the planner tolerance was decreased from 2.0 -> to 0.5.

Using test_planner_costmaps_node test as an example of what's happening in CI. When tolerance was 2.0, the planner produced path as follows:

   point #0 with x: 1 y: 1
   point #1 with x: 1.7 y: 1.7
   point #2 with x: 2.05 y: 2.05
   point #3 with x: 2.4 y: 2.4
   point #4 with x: 2.76 y: 2.76
   point #5 with x: 3.11 y: 3.11
   point #6 with x: 3.46 y: 3.46
   point #7 with x: 3.82 y: 3.82
   point #8 with x: 4.17 y: 4.17
   point #9 with x: 4.53 y: 4.53
   point #10 with x: 4.88 y: 4.88
   point #11 with x: 5.23 y: 5.23
   point #12 with x: 5.59 y: 5.59
   point #13 with x: 5.94 y: 5.94
   point #14 with x: 6.29 y: 6.29
   point #15 with x: 6.65 y: 6.65
   point #16 with x: 7 y: 7
   point #17 with x: 8 y: 8
   point #18 with x: 8 y: 8

After decreasing the tolerance to 0.5 the planner became to produce the another path:

   point #0 with x: 1 y: 1
   point #1 with x: 1.7 y: 1.7
   point #2 with x: 2.05 y: 2.05
   point #3 with x: 2.4 y: 2.4
   point #4 with x: 2.76 y: 2.76
   point #5 with x: 3.11 y: 3.11
   point #6 with x: 3.46 y: 3.46
   point #7 with x: 3.82 y: 3.82
   point #8 with x: 4.17 y: 4.17
   point #9 with x: 4.53 y: 4.53
   point #10 with x: 4.88 y: 4.88
   point #11 with x: 5.23 y: 5.23
   point #12 with x: 5.59 y: 5.59
   point #13 with x: 5.94 y: 5.94
   point #14 with x: 6.29 y: 6.29
   point #15 with x: 6.65 y: 6.65
   point #16 with x: 7 y: 7
   point #17 with x: 7.5 y: 7.5  <- not equal to x: 8.0 y: 8.0

Which causes messages like:

[WARN] [1589538666.451800381] [PlannerTester]: Path deviates from requested start and end points

and failures of testcases.

I think, produced paths in both cases are correct, since the goal places within its tolerance <=0.5 but according to PlannerTester::isWithinTolerance() logic goal should be strictly equal to (x: 8.0 y: 8.0):

 if (
    path_start.pose.position.x == robot_position.x &&
    path_start.pose.position.y == robot_position.y &&
    path_end.pose.position.x == goal.pose.position.x &&
    path_end.pose.position.y == goal.pose.position.y)
  {
    RCLCPP_DEBUG(this->get_logger(), "Path has correct start and end points");
    return true;
  }

It looks like the conditions in PlannerTester::isWithinTolerance() should be changed in order to taking deviationTolerance into account:

    path_start.pose.position.x == robot_position.x &&
    path_start.pose.position.y == robot_position.y &&
    path_end.pose.position.x <= goal.pose.position.x + deviationTolerance &&
    path_end.pose.position.x >= goal.pose.position.x - deviationTolerance &&
    path_end.pose.position.y <= goal.pose.position.y + deviationTolerance &&
    path_end.pose.position.y >= goal.pose.position.y - deviationTolerance)

AlexeyMerzlyakov commented 4 years ago

Fixed problem in #1734

SteveMacenski commented 4 years ago

So my thought on this is that this test is supposed to stress the planner so I think the exact, zero tolerance planning is what we want here. A 0.5m deviation for a 0.05m resolution costmap search is really, really high. Typically that tolerance is set so that if you send the robot to a goal, lets say in a warehouse, and there's a dynamic obstacle in the way like a forklift. You want the robot to get to that area for their job, but clearly the intention isn't to wait 20 minutes to succeed based on the forklift.

So for these tests, and in particular the costmap test, this should pass 100% of the time without any help with a tolerance. Since the costmap tests are hand designed ( I believe) to be possible and just make sure that search is working properly. The random tests could result in failures, but we have the planner test set at some really low bar to pass (something like 10% with 3 attempts) that it can reach 100 random goals.

Does that make sense? Am I missing something? I think that this test actually makes things a little less reliable. I think the only tolerance that would be appropriate is the costmap resolution (e.g. 1 cell) for the case that the planner after reaching its goal doesnt add an extra waypoint at the literal goal location so its last waypoint is just neighboring 1 cell over. However given that things were working at one point with 0 toleranceI believe, that should be possible.

AlexeyMerzlyakov commented 4 years ago

Does that make sense? Am I missing something?

No, that sounds very reasonable. If the test is intended to reach exact goal, does it mean that path planner fails or has some lack of accuracy?

SteveMacenski commented 4 years ago

I'm not sure, you may want to generate a visualization. I can say that from using it in my own purposes, I have never had an issue with the planner generating an exact path when one is available. So this feels like a test issue. I'd start with the costmap test since those are exactly hardcoded and small so those should always be exact matches, except for any costmaps that are specifically designed to make it fail.

AlexeyMerzlyakov commented 4 years ago

After some digging in code, it appears that problem may be in NavFN planner inside NavfnPlanner::makePlan() routine. I am not finally clear, but it looks like calculation of best_pose from goal=(8,8) makes it to be equal to best_pose=(7.5,7.5). Later in algorithms best_pose is going to be used instead of goal. Inside smoothApproachToGoal(best_pose, plan); produced path is being adjusted to inaccurate (7.5, 7.5) final point that results a problem. There are just my intermediate observations. Will continue analysis further.

AlexeyMerzlyakov commented 4 years ago

Update: the best_pose calculating algorithm is briefly presented in the pseudo-code on picture below: issue_analysis

The point p traverses the {goal-tolerance..goal+tolerance, goal-tolerance..goal+tolerance} region with map_resolution step and finds the nearest reachable point to the goal. This causes changing the goal and final point in the path produced by navfn planner from (8.0, 8.0) -> to (7.5, 7.5). It is not finally clear - what does this algorithm intended for? It looks like it is developed for making the goal to be reachable within its tolerance even if the goal is unreachable by itself. If so, why do not just to add goal point itself into this traversing algorithm and check, if it is reachable too?

AlexeyMerzlyakov commented 4 years ago

How about the following fix according to above: https://github.com/ros-planning/navigation2/pull/1734/commits/1648f4ddb72bbc1073029134794fe029e36ecf7c, https://github.com/ros-planning/navigation2/pull/1734/commits/5e212347b83aa671305ec054505f0f2a35afc11d ?

siddhya commented 4 years ago

Does it make sense to have tolerance less than the resolution of the map? When I do the following change the test runs fine. You can suggest a better place to do the change but maybe we can set the tolerance == resolution of the selected map.

diff --git a/nav2_planner/src/planner_server.cpp b/nav2_planner/src/planner_server.cpp
index dbb159d7..db6668bb 100644
--- a/nav2_planner/src/planner_server.cpp
+++ b/nav2_planner/src/planner_server.cpp
@@ -81,6 +81,7 @@ PlannerServer::on_configure(const rclcpp_lifecycle::State & state)

   get_parameter("planner_plugin_ids", plugin_ids_);
   get_parameter("planner_plugin_types", plugin_types_);
+  declare_parameter(plugin_ids_[0] + ".tolerance", rclcpp::ParameterValue(1.0));
   auto node = shared_from_this();

   if (plugin_ids_.size() != plugin_types_.size()) {

siddhya commented 4 years ago

Also can you take a look at this change? initial_pose is needed for the tests to run so there is no point in having a separate retry/timeout logic for it. I don't have access to CI system but I am not getting any timeout issues after this. Let me know what you think.

AlexeyMerzlyakov commented 4 years ago

Does it make sense to have tolerance less than the resolution of the map?

I'd like agree with you that the tolerance should never be less than map_resolution. But this problem does not related to this effect. For example you can take map_resolution=1.0, tolerance=2.5 and goal=(8.0, 8.0) which will give best_goal=(7.5, 7.5) again and lead to accuracy loss. The effect is concerned to the cases where the tolerance is not multiple of map_resolution. Another fix, I've considered - could be round the tolerance during the calculations to be nearest multipe of map_resolution. It works, but I personally do not like this change, because of the navfn_planner operates in the world coordinates which do not have to be multiples of map cell coordinates.

SteveMacenski commented 4 years ago

The best_pose bit is only used in NavFn if an exact path couldn't be found. Its trying to scale back along the path and find the closest point it can within the tolerance. Is that 7.5 and 8.0 where the costmap resolution is 0.5?

Having the tolerance less than resolution means that you want an exact match only, which might be required when running in a true grid-world like one might have to create when working with hundreds of robots in a single facility.

None of this really explains how this was working beforehand and became flaky afterwards. I'm OK with adding exactly a 1 cell resolution tolerance on it for aliasing effects, but any more than that indicates an issue with the test still

AlexeyMerzlyakov commented 4 years ago

The best_pose bit is only used in NavFn if an exact path couldn't be found.

I can not find a clue that NavFn currently uses best_pose only when the exact path did not produced: best_pose searching algorithm is placed in middle part of the NavfnPlanner::makePlan() code, running without any if-condition or having a non-false return before. This mean that best_pose searching algorithm should be executed each time once we entered makePlan(). Later in the same routine calculated best_pose is being used instead of goal to get the plan:

getPlanFromPotential(best_pose, plan);
...
smoothApproachToGoal(best_pose, plan);

From outside: NavfnPlanner::makePlan() is directly called by NavfnPlanner::createPlan(). createPlan() is a plugin API interface method that called by planner server when it requires to make a path. So, it looks like best_pose is calculated and used instead of goal each time when planner server demands a global plan. If so, the issue we are currently disscussing might be a NavFN bug, I belive.

Probably, I've missed something or did you mean that before NavFN planner another planner plugin is being called?

Its trying to scale back along the path and find the closest point it can within the tolerance. Is that 7.5 and 8.0 where the costmap resolution is 0.5?

Yes, there is the place where best_pose is chosen to be equal to (7.5, 7.5) which causes a test failure.

Having the tolerance less than resolution means that you want an exact match only, which might be required when running in a true grid-world like one might have to create when working with hundreds of robots in a single facility.

Ok, got it. This should not change the situation fundamentally, I think.

None of this really explains how this was working beforehand and became flaky afterwards.

I've made some explanation in PR for this ticket in https://github.com/ros-planning/navigation2/pull/1734#issuecomment-630718350.

I'm OK with adding exactly a 1 cell resolution tolerance on it for aliasing effects, but any more than that indicates an issue with the test still

I think, this will just hidden the problem but not resolve it.

SteveMacenski commented 4 years ago


  double resolution = costmap_->getResolution();
  geometry_msgs::msg::Pose p, best_pose;
  p = goal;

  bool found_legal = false;
  double best_sdist = std::numeric_limits<double>::max();

  p.position.y = goal.position.y - tolerance;

  while (p.position.y <= goal.position.y + tolerance) {
    p.position.x = goal.position.x - tolerance;
    while (p.position.x <= goal.position.x + tolerance) {
      double potential = getPointPotential(p.position);
      double sdist = squared_distance(p, goal);
      if (potential < POT_HIGH && sdist < best_sdist) {
        best_sdist = sdist;
        best_pose = p;
        found_legal = true;
      }
      p.position.x += resolution;
    }
    p.position.y += resolution;
  }

Lets look at this block of code. What I see happening is:

We have the goal in real-world coordinates
We subtract the tolerance from it
We then look and see if the goal - tol <= goal + tol which should always be true?
We update the X coordinate to match the subtraction in Y
We then look and see if the goal - tol <= goal + tol which should always be true?
We get the potential and distance from goal, and the closest and non-collision one is updated
We inch forward in increments of resolution to the goal across this goal - tol to goal + tol in both axes.

This means that there could only be exact matches if the tolerance is in increments of resolution, but in all cases, you should hit every cell along the way, whether that's center, corner, etc. So while you might not get the exact world coordinate, you should get within the same resolution sized cell as it.

Also means that if you make tolerance hilariously large, it will go over all of the cells in a hilariously large window, which is unexpected. Good to know. But also means that the reason that 2.0 meters worked is not because it was large, but because it was a multiple of the costmap resolution used in these experiments. One way to solve that is to just use another tolerance so that's true here as well.

This test is to make sure that things are working and we can get exact paths that there's nothing wacky going on, so I think that's a OK workaround. Another option is to test the test based on grid-world and not world-world, less ideal. A third option would be to fix the planner so that exact paths are respected which is what i think you did.

OK, if that's what you did and that description is in line with your thinking, that sounds like a good work around, I can't think of a reason not to do that. I think the only thing would be to look at getPlanFromPotential and smoothApproachToGoal and sanity check yourself that this check in the double while loops didn't do something that without will cause you issues. Nothing immediately jumps out at me as being a problem.

AlexeyMerzlyakov commented 4 years ago

Both getPlanFromPotential() and smoothApproachToGoal() are looks like designed to work with goal in first argument without relying on what the goal really is: actual goal or calculated best_pose passed as first argument.

getPlanFromPotential() converts the first argument (goal) from world -> to map space and asks a planner_ to calculate and get the path (using a potential array previously calculated by A* or Dijkstra algorithms also in map space) down from a goal to the start point in the map space. It could appear that best_goal is placed in another map cell than goal. But if the goal is reachable by itself, this should only strengthen the accuracy of whole path planning algorithm, but not break it since we will go exactly to the same cell as initially willing to.

smoothApproachToGoal() just compares path[-2]<->path[-1] with path[-2]<->best_goal distance. If first one is greater that second one, than set the last point in the path to be equal to best_goal: path[-1] = best_goal. In other words, this method smooths the tail of path in order to avoid twitching caused by worldToMap/mapToWorld digitalized conversions after getPlanFromPotential(). This should not be affected by what is best_goal really is.

So, from perusal it looks like the changing best_goal to goal itself when it reachable should not produce new problems.

siddhya commented 4 years ago

waypoint_follower test seems to be failing now for me. Have you seen this issue? No error message just starts shutting down after activating.

[278.384s] 15: [lifecycle_manager-12] [INFO] [1590036346.356872100] [lifecycle_manager_navigation]: ^[[34m^[[1mActivating bt_navigator^[[0m^[[0m
[278.385s] 15: [bt_navigator-10] [INFO] [1590036346.357934700] [bt_navigator]: Activating
[278.387s] 15: [lifecycle_manager-12] [INFO] [1590036346.359877600] [lifecycle_manager_navigation]: ^[[34m^[[1mActivating waypoint_follower^[[0m^[[0m
[278.388s] 15: [waypoint_follower-11] [INFO] [1590036346.360757000] [waypoint_follower]: Activating
[278.390s] 15: [lifecycle_manager-12] [INFO] [1590036346.362667300] [lifecycle_manager_navigation]: ^[[34m^[[1mManaged nodes are active^[[0m^[[0m
[278.391s] 15: [lifecycle_manager-12] [INFO] [1590036346.363584700] [lifecycle_manager_navigation]: ^[[34m^[[1mShutting down managed nodes...^[[0m^[[0m

SteveMacenski commented 4 years ago

@siddhya I have noticed that on just release builds but not debug builds in the last couple of PRs. That is super weird. There's a bunch of logging in the tester node that your snippet doesn't show either (https://github.com/ros-planning/navigation2/blob/master/nav2_system_tests/src/waypoint_follower/tester.py) as if that's not even being run or something.

AlexeyMerzlyakov commented 4 years ago

Have you seen this issue? No error message just starts shutting down after activating.

This is interesting behavior. Unfortunately, I did not get the same problem, only another one I've mentioned in PR1759 causing a waypoint_follower timeouts. @siddhya, @SteveMacenski I think these problems with test_waypoint_follower deserve a separate ticket for analysis.

SteveMacenski commented 4 years ago

Is there a circleCI job that did it, can you give us a link?

siddhya commented 4 years ago

I do not see the issue @AlexeyMerzlyakov is seeing. For me it again was with the setup and retry logic. This time in waypoint_follower/tester.py. Gazebo takes forever to startup on my machine. Once I upped both the below values all tests passed.

# wait a few seconds to make sure entire stacks are up
time.sleep(10)

retries = 2
while not test.initial_pose_received and retry_count <= retries:

SteveMacenski commented 4 years ago

CI would also agree, I think this can be considered closed unless it comes back again. Thanks for your help!

ros-navigation / navigation2

Flaky Planner Tests (part 3) #1733