moveit / moveit_ros

THIS REPO HAS MOVED TO https://github.com/ros-planning/moveit
69 stars 118 forks source link

Race conditions when updating PlanningScene: fixup #716 #728

Closed rhaschke closed 7 years ago

rhaschke commented 7 years ago

This continues #716 and #724 which were both merged already. I repeat here my comments to #724:

@v4hn Realizing the simplification we agreed on in the phone call turned out to be rather difficult:

  1. simplification of stateUpdateTimerCallback() (removing the additional timestamp check) caused dead locks. Hence, I reverted these changes.
  2. I successfully moved validatePlan() from MoveGroupInterface (client side) to MoveGroup's ExecutionServiceCapability (server side) - as agreed.
  3. Simplification of syncSceneUpdates(): We definitely need both while loops and the timeout. When the robot doesn't move at all, we will never receive a scene update. The first loop is required to check for a recent robot state update directly in CSM, while the second loop (in case there is no CSM) waits - with a timeout - for a direct scene update. There is no way to omit this timeout. The only chance would be to trigger all input channels of PSM to send updates - which is not possible. However, in move_group, this doesn't pose a problem, because there is a CSM instantiated.
  4. Removing syncSceneUpdates() after trajectory execution makes sense. Doing so, led to failure of the test suggested in #442. Now I fixed it. CSM::waitForCurrentState() didn't do what I expected from the name...

Some more thoughts:

davetcoleman commented 7 years ago

I think the integration tests make this PR way too big - should be separate as discussion on call

rhaschke commented 7 years ago

I'm afraid that I cannot handle the comments before next week. Please be patient.

v4hn commented 7 years ago

@rhaschke too bad. Sorry it took me so long :( @130s is it ok, if this will take another week before a new indigo release? Releasing without this fixed-up request is not an option.

davetcoleman commented 7 years ago

Hopefully before August 5th...

v4hn commented 7 years ago

@davetcoleman , @130s, @mikeferguson / whoever feels responsible. This is a bit critical, so please read and comment.

@davetcoleman this is not going to happen before Friday, I've been told @rhaschke is on vacation starting this Monday.

He will be really pissed because of that, but at the moment I'm considering a revert of the whole list of commits for #442 up to now. Maybe that's what we should have done directly after the unreviewed merge... We have at least one open bug with respect to these changes by @dornhege, a list of things that should be improved, ABI breakage w.r.t. the last release and I just noticed a segfault in the RViz plugin in my setup when removing the MotionPlanning display, that gdb blames on the new spinner_ that has been added to the PlanningSceneMonitor. I'm not sure we can have that spinner there, because ros keeps a recursive mutex on spinners, and there must only be one thread that keeps spinners. I'm pretty sure this might have broken API in a totally unintuitive way (definitely not acceptable in indigo), because if you keep a AsyncSpinner around from your main thread and create the monitor from a different thread things probably break.

Overall, this is no state in which we should let people work on "minor" issues and the whole "fix the start state" patch is clearly not ready for a release.

I therefore propose to revert these change-sets and try to add only the part that makes the ExecutionController check whether the trajectory it is about to execute starts somewhere around the current state. This should be possible without most of this code and without breakage. Later on we can focus on getting this to work without this many issues..

davetcoleman commented 7 years ago

Yes, after seeing https://github.com/ros-planning/moveit_ros/issues/736 and possibly related https://github.com/ros-planning/moveit/issues/10 I have been worried about the state of the PlanningSceneMonitor. Considering it was never fully reviewed I do not think we should feel guilty about the revert. +1

dornhege commented 7 years ago

+1 for the revert. Ideally before the merge and unless it is shown that #736 is not an issue in general or independent from this, we cannot release. The PR is still there to be worked on, but this seems to be not trivial to get right.

rhaschke commented 7 years ago

Looks like AsyncSpinner isn't really usable as intended. As I can't work on the issue in the next days, go for the revert as suggested by @v4hn.

davetcoleman commented 7 years ago

@v4hn can you do the revert today before tomorrow's merge?

v4hn commented 7 years ago

I opened a request to revert the previous merge commits in #742 .

I'll close this request. We will have to look through the changes again anyway and rebase/reformat them for the merged repo.

rhaschke commented 7 years ago

@v4hn @dornhege @davetcoleman I'm back from vacation and I'm going to look into the AsyncSpinner issue now.

Obviously any use of an AsyncSpinner is currently not safe: If they are started from different threads, only the very first one will actually be started. Hence it depends on the order of their start, which one will be active, which is not acceptable. As there are multiple AsyncSpinners used throughout the source, #736 reported by @dornhege might be only the tip of the iceberg.

I will try to fix that issue upstream in ros_comm and then analyze, why it failed in #736 but not for me.

v4hn commented 7 years ago

On Mon, Aug 15, 2016 at 12:32:20AM -0700, Robert Haschke wrote:

I will try to fix that issue upstream in ros_comm and then analyze, why it failed in #736 but not for me.

Thanks! This is clearly a bug in ros_comm and the commit that introduced the global mutex there is at fault. The trouble is, it might be hard to remove it again without understanding why it was introduced in the first place.. Maybe it's worth a try to contact the author? It has been a couple of years, but maybe he remembers the original bug (if there was one)?

davetcoleman commented 7 years ago

Welcome back @rhaschke, hope vacation was good! Thanks for continuing to work on this

rhaschke commented 7 years ago

The other usage of AsyncSpinners in moveit_ros source tree is uncritical. They mostly belong to stand-alone programs. I think, the reason that @dornhege observed the bug, but not me, was that he was running the whole perception pipeline too. The perception/mesh_filter instantiates its own PlanningSceneMonitor and thus has thrown the error.

I prepared https://github.com/ros/ros_comm/pull/867 and hope that they will consider it timely.

v4hn commented 7 years ago

Wow, given the scope of your patch @rhaschke I'm pretty sure this will not be merged in indigo, not sure about kinetic. But we should probably fix the safety issue in indigo...