ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.52k stars 1.28k forks source link

fix bug mentioned in #3958 && a futhuer suggestion for child-LifecycleNode mechanism design #3971

Closed GoesM closed 11 months ago

GoesM commented 11 months ago

Bug report

Required Info:

Steps to reproduce issue

Expected behavior

expected behavior what the degisner wish should be like

costmap_ros_->deactivate() works during planner/controller->on_deactivate(), as well as costmap_ros_->cleanup() works during planner/controller->on_cleanup()

.../nav2_planner/src/planner_server.cpp#L224 and L257, and similar code in nav2_controller

Actual behavior

As an individual LifecycleNode, costmap_ros_ may react to rcl_preshutdown randomly, and always be closed much earlier than planner/controller->deactivate()

it would lead to many unexpected crashes like bug mentioned in #3958 :

during shutdown_period, costmap_ros_ may react to rcl_preshutdown earlier, so that planner/controller may try to access the costmap_ros_ though it has been a nullptr. As a result it would lead to a crash during shutdown period.

such crashes may lead to bad situation that :

program shut down in an unplanned manner and result in residual processes.


Additional information

costmap is set as a Nav2::util LifecycleNode : ../nav2_costmap_2d/include/nav2_costmap_2d/costmap_2d_ros.hpp#L 73

all LifecycleNode would react to Ctrl+C ../nav2_util/src/lifecycle_node.cpp#L 90

all LifecycleNode would react to rcl_preshutdown randomly, and similar issue has been discussed before : ../ros2/rclcpp/issues/2096

this explanation should have been clear enough: ../nav2_planner/src/planner_server.cpp #L 214-L225

the bug happening in ISSUE: #3958 is caused by this.

because during shutdown_period, costmap_ros_ may react rcl_preshutdown earlier, so that planner/controller may try to access the costmap_ros_ though it has been a nullptr.


our provided solution

it seems that costmap is only used as a child-thread of planner/controller, ../nav2_planner/src/planner_server.cpp#L 89

so can we make it do nothing when reacting to Ctrl+C ?

doing so, its lifecycle would be completely bond to planner/controller's lifecycle, and also going to death in order.

//in file `costmap_2d_ros.hpp`
  /**
   * @brief override on_rcl_preshutdown() as empty
   * [reason] costmap only could be created by its parents like planner/controller_server
   * [reason] costmap may react to ctrl+C earlier than its parents to cause nullptr-accessed
   * so the costmap's reaction must be later than its parent to avoid nullptr-accessed
   * 
   * thus, it's a more perfect way to override on_rcl_preshutdown() as empty function
   * and its deactivate must be joint in planner/controller_server->on_deactivate()
   * and its cleanup must be joint in planner/controller_server->on_cleanup()
   * 
   * !!! a mention !!!
   * if any other place use this class (Costmap2DROS)
   * it's necessary to let costmap->deactivate() joint in parent->deactivate()
   * and let costmap->cleanup() joint in parent->cleanup() 
   */
  void on_rcl_preshutdown() override{
    //do nothing
    return ;
  }

because the costmp_ros_ is just a thread created by planner/controller,

whether program shutdown correctly or not, or even dead unexceptionably, the child-thread could be closed all the time since parent-LifecycleNode sub-process dead.

Thus there's no need to worry whether it would be still alive after whole program exit finally.

in addition

if our solution being adopted, many checks could be removed :

these double checks are also designed to face to bugs in situation that costmap_ros_ may be closed earlier than planner/controller. After our code modification, these checks are useless anymore.

what's more, our solution performs perfectly during our local tests, much more than our past solutions mentioned in #3958



Further Suggestion

We specially name nodes like costmap_ros_ as child-LifecycleNode, which is born from their parent-LifecycleNode like planner_server and controller_server.

our suggestion is:
Could we set up a mechanism specially for such situations like child-LifecycleNode, to make sure they're completely controlled by their parent?

these child-LifecycleNode should have some basic specialist like:

such mechanism could be like:

Perhaps such mechanism could be used for any program in ROS2 ?

GoesM commented 11 months ago

we've created a new PR #3972 to fix the bug.

SteveMacenski commented 11 months ago

Discussing inline in PR