Fix MoveGroup::setStartStateToCurrentState() (issue 442)

andreaskoepf commented 8 years ago

We ran into https://github.com/ros-planning/moveit_ros/issues/442 in our lab too and observed some very threatening and dangerous high-speed motions of our UR5 robotic arm. On a real industrial robot this issue is extremely safety-critical since it can lead to plans that start far away from the current robot joint position even though MoveGroup::setStartStateToCurrentState() has been called. I am personally quite shocked that this was not fixed by somebody else before since #442 is more than two years old.

Some background details about the problem as understood by us:

Calling MoveGroup::setStartStateToCurrentState() internally nulls the considered_start_state_ and sends a move-group planning request action goal with an empty start state (joint-names & positions arrays are emtpy). This makes MoveIt! directly use the RobotState which is stored inside the planning scene, e.g. when using OMPL ModelBasedPlanningContext::setCompleteInitialState() will be called from ompl_interface::PlanningContextManager::getPlanningContext() with a state generated by robot_state::RobotStatePtr start_state = planning_scene->getCurrentStateUpdated(req.start_state); .. when req.start_state is 'empty' this means effectively that the current state of the planning scene is is used.

Unfortunately the action server callback MoveGroupMoveAction::executeMoveCallback does not ensure that a robot state newer than the planning request is available. This leads to plans that use an outdated start state. Actually I am not quite sure how far back in time the invalid start state can be since from the motions that we saw it must have been significantly > 0.1s... maybe somebody from the MoveIt! team with more expertise in this specific area can explain why we saw such severe movements....

How we address the problem:

This PR makes sure that a new JointState message is received by CurrentStateMonitor and is indeed written to the RobotState of the PlanningScene before the planning request is being processed. In our tests this fixed the problem (which otherwise is quickly reproductible). In this PR the change is done only exemplary for the MoveGroupMoveAction .. other actions might be affected by an outdated current state too.

Since the MoveIt! source base is new to me and the project is quite large this PR is intended as a discussion basis on how to best address the problem. E.g. the sleep-based wait loop should better be replaced by a boost::condition_variable or some other more direct signalling mechanism, etc.

Special thanks to @alemme for debugging this issue together with me.

simonschmeisser commented 8 years ago

It has surprised me as well a couple of times how badly synchronized stuff can be in MoveIt!. I somehow manage to forget after some time about all the sleep() calls I had to put into my code.

I like your idea about saving the last update time. However I think this should be stored in the RobotState itself. Ideally the timestamp of when the position information entered the ros system (ie when the joint position message arrived from the robot controller).

JimmyDaSilva commented 8 years ago

Works like a charm for me. Glad someone finally digged into this :) Thanks !

andreaskoepf commented 8 years ago

@simonschmeisser I would appreciate to have time stamps inside the RobotState itself - something similar to the values current_state_time_ and joint_time_ currently stored externally in CurrentStateMonitor, e.g. current_state_monitor.cpp#L312.

130s commented 8 years ago

@andreaskoepf Could you rebase to the latest source of origin/indigo-devel? Your PR is based on outdated commits and due the the old .travis.yml Travis would probably never pass.

andreaskoepf commented 8 years ago

@130s Rebasing would be no problem. My hope was that some MoveIt! expert would review the changes and give me some hints how to fix this in a more general way. E.g. I have not yet checked how planning scene changes (e.g. addition/removal of collision objects) is incorporated in the planning process and if that faces the same problem e.g. it might not be guaranteed that the most recent changes are considered during planning.

I can set this on my todo list and try to find 1-2 days of time to build up a deeper understanding about the code-structure and then come up with a cleaner solution. But I will not be able to do this before June 6th...

davetcoleman commented 8 years ago

I'm not super familiar with the architecture of this part of MoveIt! but overall it seems like a reasonable fix.

alainsanguinetti commented 8 years ago

Maybe we should deprecate this method? Could it be a problem with how people process the callbacks ? (not fast enough or maybe async spinner thread is preempted etc .. ) As @v4hn noted, I think #686 can solve the problem. I think it is better to rely on an explicit synchronisation mechanism rather than adding some delay to hide that something was not fast enough or is not happening.

v4hn commented 8 years ago

The current state is that the MoveGroupAction fails to fulfill its specification for the case where request.start_state is empty. In this case the action might accidentally use an outdated state instead of the current state and I don't see a different solution than the one proposed here. I think a cleaned version of patch should definitely be applied.

@andreaskoepf could you please address the comments by dave and me?

I think the sleep is fine for now to fix this hazardous behavior. If later on someone steps up to replace it by a mechanism using condition_variable et. al. that would be great, but not mandatory.

alainsanguinetti commented 8 years ago

Hello, I think the title of this PR is misleading because it pretends to fix the setStartToCurrentState method while it actually addresses issue #442 which is a much bigger problem which resides mainly in the move_group package. I think we can do two things:

actually improve setStartToCurrentState by making it use current_statemonitor.getCurrentState() which is quite easy and would make it possible for the requesting code to manage what is the start state.
fix the problem on the move_group node.

I think, as I said before, that it is a problem with how callbacks are processed. To show that, in move_group/move_group.cpp line 152 you can change: ros::AsyncSpinner spinner( 1 ); to ros::AsyncSpinner spinner( 0 ); which actually tells the spinner to use as many threads as there are cores on the computer. In my case, this solved the problem or at least significantly reduced it. My explanation would be that when there is only one thread, if the callbacks corresponding to the action are processed by the spinner before the state update callbacks then it is impossible to have the state be updated before the action is processed.

In any case, changing from 1 to 0 improved the smoothness of the display so I think I'll open a PR just for that.

rhaschke commented 8 years ago

@alainsanguinetti raised an important point: Having only a single spinner thread, all traffic is processed strictly in order. There are some blocking services, e.g. MoveGroupExecuteService and maybe more, which take a while for processing. During that time, no other ROS messages are processed.

In case of a waiting MoveGroupExecuteService that means, that the service waits for the trajectory controller to finish, however, incoming joint state updates are not processed and pile up in the callback queue. Afterwards, it takes a while to process all of them. For this reason the proposed fix doesn't solve the issue: Because updateSceneWithCurrentState stores the current time as last_robot_state_update_time_ but not the (eventually outdated) timestamp of the joint_state message, the waiting loop might stop to early, being satisfied with an outdated state.

To fix the problem in a more fundamental fashion, I suggest to modify planning_scene_monitor::LockedPlanningSceneRO such, that it waits until all scene updates with a timestamp < ros::Time::now() are processed and only then returns the current planning scene. This indeed would require a condition variable to be efficient.

Another issue might be bad locking behaviour of the PlanningScene in PlanningSceneMonitor. I observed a write-lock time of 0.277s for a state update. I don't yet fully understand the reason, but eventually the multiple number of read-only accessors block a write-accessor. In this case, a pending write-accessor should block all future read-accessors?

Increasing the number of spinner threads is a quick fix. However, we should carefully look for those blocking services and put them actively in threads (instead of running them in the spinner thread only).

v4hn commented 8 years ago

On Thu, Jul 14, 2016 at 07:12:37PM -0700, Robert Haschke wrote:

@alainsanguinetti raised an important point: Having only a single spinner thread, all traffic is processed strictly in order. There are some blocking services, e.g. MoveGroupExecuteService and maybe more, which take a while for processing. During that time, no other ROS messages are processed.

Yes, we should work on that front.

This does not solve #422 in theory though. The problem there is basically the same one as previously noticed with the ObjectRecognitionAction: Traditionally the action server does not get to know exactly when it is supposed to start using data, i.e. input images. https://groups.google.com/forum/#!msg/ros-sig-perception/j1JtB5dkkIU/tEF_kOZAU-oJ

We have the same problem with the planning scene/robot state to use for the action request, and I believe we should change that somehow. Cleaning up threading in MoveIt does not resolve this design flaw, but will reduce the number of errors quite a lot to make it harder to trigger the bug (which does not mean that it will not appear the next time you have to show a demo).

rhaschke commented 8 years ago

This does not solve #422 in theory though. The problem there is basically the same one as previously noticed with the ObjectRecognitionAction: Traditionally the action server does not get to know exactly when it is supposed to start using data, i.e. input images. https://groups.google.com/forum/#!msg/ros-sig-perception/j1JtB5dkkIU/tEF_kOZAU-oJ

As far as I can see, Martin Günther proposes exactly the same concept as me: Make sure that the input data is current (here: all input to PSM previous to a timestamp is processed) w.r.t. the timestamp of the call of an action. Indeed, currently there is no method to pass this timestamp explicitly. However, in case of PSM, I think ros::Time::now() will be sufficient.

Please continue this discussion in #716.

andreaskoepf commented 8 years ago

Please continue this discussion in #716.

moveit / moveit_ros

Fix MoveGroup::setStartStateToCurrentState() (issue 442) #671