Closed tfoote closed 11 years ago
@miguelsdc Can you please provide a complete and reproducible example? The code from the referenced answers.ros.org post seems to be incomplete.
@dirk-thomas see https://github.com/miguelsdc/deadlockTest repository. It has a complete ROS project (rosbuild, haven't yet got the hang of catkin) which will make actionlib deadlock.
I know the code seems too demanding on actionlib and not representative of a real workload (eg. high frequency of requests, remote task is not really useful). As it turns out, getting one thread to accept an already-exisiting goal at the same time as the other thread is processing a newly arrived goal doesn't happen easily. The code in the repo can do deadlock reliably (particularly with the brutal launch file). Moreover, for some reason, using naoqi+nao_controller.py, also triggers the deadlock (which is how I originally cam across this issue).
Unfortunately, deadlockTest also deadlocks even after the proposed patch #5 is applied. There are 3 possible reasons: 1) deadlockTest is badly coded, 2) the patch is badly coded, 3) there's another potential deadlock in simple_action_server.py.
I will investigate further tomorrow...
As it turns out, deadlockTest unconvered other race conditions. #5 has been updated to fix these race conditions.
As an informal test, roslaunching deadlockTestBrutal results in a deadlock in less than 5 seconds on my machine. After applying the patch, my machine was happily running for 15min...
Hope that helps, Miguel S.
I ran you example an reviewed the changes and can confirm that it fixes the deadlock.
But I am a bit worried about the removed lock in the executeLoop()
. I do see why it was necessary to avoid locking in the wrong order but I think that enables new race conditions (but I could not modify your example yet to expose the cases I suspect exist).
These actions can be safely performed without locking; at worst, we wait for an iteration to process a new goal
I am actually not sure to agree to this statement in the patch. I think worse things can happen than just skipping a cycle. One example would result in error messages due to not locking around self.is_active()
anymore.
Okay I've been re-reading the source code of simple_action_server.py and I still think there's no need to lock that area in executeLoop()
. Let me explain why.
There are two functions that can potentially cause race conditions: self.is_active()
and self.is_new_goal_available()
.
Regarding the first one: in polling implementations there's no problem because executeLoop()
is never called. In callback implementations I'm assuming all self.set_aborted/preempted/succeeded()
should be called by self.execute_callback(goal)
. If the user forgets to set the status, then the goal is set to aborted (and the user is warned). I've searched through the code and found no instance of any other functions setting the state of the current goal.
As far as I can tell, the only calls to change the outcome of set_active()
are made by the executeLoop()
thread. So no race conditions on the first one, unless I'm missing something?.
Similarly, for self.is_new_goal_available()
there could be a serious error if self.new_goal
was set to False between executeLoop()
's self.is_new_goal_available()
and self.accept_new_goal()
. However a quick search of the source code reveals that self.new_goal
is only set to False by self.accept_new_goal()
. Ergo, no race conditions :)
Nevertheless just needing that much text to explain that code is okay (assuming I haven't missed anything) is clearly far from ideal, and could cause trouble should some of the assumptions stop being valid in the future.
I see three options going forward:
with self.action_server.lock, self.lock:
statement and leave it as it was before (I originally had it like this, but I was concerned it may be too long a time locked for action_server.py)self.lock
and just use self.action_server.lock
instead.I reckon we should only take option 1. if you don't find any issues with the patch as it is now.
Option 2. is the easiest to implement and to be honest I don't think the performance penalty is going to be that significant.
Option 3. may seem stupid, but as the code stands whenever you have self.lock
you end up having self.action_server.lock
too (the only case where this is not explicit is on internal_goal_callback()
and internal_preempt_callback()
but I'm pretty sure those functions are called from within critical sections of action_server.py anyway). Option 3 has the advantage that deadlocks are no longer possible and makes the code more readable.
Which one should we go for? Or do you have any other ideas?
I wrapped my head around it one more time. I still think that there are potential race conditions related to is_active
and is_new_goal_available
being called outside a lock. But the worst that should happen in this case should be a warning/error message on the console.
Therefore I think option 2 or 3 would be over-locking.
I have created an updated pull request which mainly adds comments and removes the useless shall_run
variable`. Does that still look good to you?
It would be very valuable to have your deadlock test as an actual unit test in actionlib. Could you convert it into a unittest using GTest to ensure that this will not break again in the future?
The updated pull request looks great! Indeed, the comments are more useful and the code in executeLoop()
is clearer.
I think it may be a good idea to cherry pick the patch in the fuerte, groovy and hydro branches. (All distros currently share the same simple_action_server.py file after all). Even more so since fuerte's EOL is fast approaching. What do you think?
I'll convert deadlock test into an actual unit test (but using python's unittest if that's okay) as create a new pull request for it shortly.
Great, thank you.
Just added a new pull request with the proposed test (see description in #8).
Resolved with #6 and covered by test in #9.
From: http://answers.ros.org/question/50276/python-simpleactionserver-stops-unexpectedly-deadlock/
I wasn't able to get the python debugger to produce the stack traces that explained the problem. But after looking at the roslog debug messages and the source code, I came up with the following reconstruction of the events:
Thread A (rospy's message dispatcher?)
Deadlocked in: self.execute_condition.acquire() In function: SimpleActionServer.internal_goal_callback() [simple_action_server.py:211] which was called from: ActionServer.internal_goal_callback() [action_server.py:293] This thread has ActionServer.lock and wants to acquire SimpleActionServer.lock (condition variable was initialised with the latter lock). Thread B (SimpleActionServer's executeLoop thread)
Deadlocked in: with self.action_server.lock In function ServerGoalHandle.set_accepted() [server_goal_handle.py:71] which was called from: SimpleActionServer.accept_new_goal()[simple_action_server.py:131] which was called from: SimpleActionServer.executeLoop()[simple_action_server.py:284] which at that point is holding SimpleActionServer.lock This thread wants ActionServer.lock and has SimpleActionServer.lock In summary, if if a new goal arrives at the same time executeLoop is trying to get a previous (but still new, SimpleActionServer will deadlock.
I suspect the solution involves calling accept_new_goal() [simple_action_server.py:284] without holding SimpleActionServer.lock. My intuition is that simply setting a flag will do, but I will have to study the code a bit more to make sure there no side-effects.