Memory corruption/leaks in episodal training simulation

guru-florida commented 3 years ago

This is an elusive bug and I could use some advice tracking it down. I wrote a Hyperopt py script that optimizes my Gazebo friction/surface/joint/etc parameters. It uses spawn_entity/delete_entity service calls repeatedly to run a simulation of my robot collecting IMU and odometry to compute an optimization score.

It worked great! ...but along the way I've discovered there are a few hiccups to the repeated cycling of a model entity. Something in the tear-down is causing memory corruption that eventually causes a SEGV (in mostly random places unfortunately)....or causes some kind of mutex lock to fail and block forever. I believe the tear-down is complicated by the fact there are a lot of shared-ptrs that may be causing some objects to stay alive longer than expected. I've been banging on the problem for a while and I made some progress, measured in how many training episodes I could do before getting a crash.

I am currently pondering:

because I am calling in with delete_entity (a service call) that it may be holding onto node references in the service call, which cascade to references on the controllers/plugins which interfere with their tear-down? maybe.
@ahcorde I don't totally have my head around the controller spawning and executer/spin-thread interaction so perhaps we can review that?
I dont think this is much of an issue in non-episodal simulations when there is only a single tear-down...i.e. "who cares if it crashes on termination"

I apologize for the insanely long text. Feel free to skip First and Second issue....go to Unresolved.

I am on Ubuntu 20.04. I have tested this on foxy (apt binaries) and latest master branch of ros2 as well.

First Issue | 1-5 episodes GazeboRosControlPlugin was crashing sometimes due to gazebo events coming in during tear down. I think we should be at least disconnecting from Gazebo first so we dont get any events coming in while we are destructing. I then explicitly call executor->cancel() and thread_executer_spin.join(). The spin thread was set to only exit if rclcpp::ok() was false so I added a flag there to exit the loop.

Second Issue | ~15 episodes Nodes (rclcpp ones) have many sub-interfaces and many sub-interfaces depend on other subs especially base-node-interface. Thus these base interfaces are constructed in a certain order, but destruction is left up to the compiler. So I modified rclcpp lib and explicitly destructed sub-interfaces in reverse order. This is documented in rclcpp Issue 1469.

Unresolved Issue I am still getting a crash around 40-60 episodes. This is enough that I've been able to tune my parameters well but I would like to take the opportunity to resolve this issue anyway. There is a little bug somewhere between gazebo plugin and rclcpp just calling out for a fix. :)

Destruction of of Parameter sub-interface after BaseInterface was destroyed I get this in the log output sometimes during tear-down: [gzserver-1] [ERROR] [1606372728.971798611] [rclcpp]: Error in destruction of rcl service handle imu_sensor/get_parameters: the Node Handle was destructed too early. You will leak memory This repeats for all the parameter sub-interface services (get_parameter_types, set_parameters, set_parameters_atomically, describe_parameters, list_parameters) and it's not always the gazebo imu plugin, it can be the odometry (p3d) or one of the ros2_control instantiated controllers. Tracked this down to the rclcpp Node destruction flow. When the destructor on the Parameter SubInterface is finally called the BaseInterface it requires to tear-down the parameter service interfaces has already been destroyed. The parameter interface holds only a weak-ptr to BaseInterface therefor it doesnt prevent it from being destroyed...it aborts safely but logs and service-interfaces are not properly torn down. I assume something else must be holding on to the Parameter interface but not the BaseInterface or Node itself...but what? ...and is it even a problem with this repo, idk?

Stopping controllers via Controller Manager I figured I could lessen the tear-down contention by stopping all controllers in the GazeboRosControlPlugin destructor using this code. This code runs after I disconnect from Gazebo events, but the executor and spin thread is still running fine.

  std::vector<std::string> start_controllers, stop_controllers;
  auto controllers = impl_->controller_manager_->get_loaded_controllers();
  std::transform(controllers.begin(), controllers.end(),
     std::back_inserter(stop_controllers),
     [](const controller_manager::ControllerSpec& spec) { return spec.info.name; });
  impl_->controller_manager_->switch_controller(
      start_controllers, stop_controllers,
      2, true,
      rclcpp::Duration(10, 0));

For some reason this blocks forever in switch_controller() call in the "wait until switch is finished" loop, controller_manager.cpp:406. It stepped through and does correctly put my two ros2 controllers in the stop_controllers collection so not sure why it never completes. It doesnt respect the 10s timeout argument either. Unless I am doing something dumb this should probably be moved out into ros2_controls issue.

bmagyar commented 3 years ago

If you have a look at how ros_control in ROS1 interacts with Gazebo, you'll find a couple of similar issues that typically lead to crashes. It's mostly related to the Gazebo plugin having the back-end gone and ros_control's actions aren't valid anymore on those pointers, other times there were hung simulations due to controllers trying to stop that had some logging element, but time was not available anymore so it just all hung there.

You are on the right track with the order of taking things down, I'm not sure how many repos this involves though.

guru-florida commented 3 years ago

Thanks @bmagyar. You're right, I did find a ROS1 mention of (apparently) the same symptoms Gazebo crashes on model deletion.

I looked into the Ros2 Gazebo factory plugin code that creates and deletes the model entity and it uses Gazebo's own pub-sub API to send the delete-entity request to Gazebo on it's own protocol, pretty simple code. Furthermore, the Ros service interface in-use in this case is the libgazebo_ros_factory.so plugin which never gets unloaded....so it truly shouldn't be any different than hitting DELETE in the gzclient, perhaps I can actually test this. I'm coming around to thinking it's not an issue of the delete-entity service request holding onto a Ros2 service pointer but really is in the teardown. If I could just make a more limited automated test that could reproduce this issue and keep wittling it down to isolate the problem. I'll give it more thought.

ros-controls / gazebo_ros2_control

Memory corruption/leaks in episodal training simulation #41