Open bit-pirate opened 10 years ago
After I implemented better exception handling in the rapp manager, I can see that there seems to be a mix up with the launchers PIDs:
[INFO] [WallTime: 1399896717.099010] Rapp Manager : stopped rapp [rocon_apps/teleop]'.
[INFO] [WallTime: 1399896717.099275] Rapp Manager : Stopping required capabilities.
[INFO] [WallTime: 1399896717.102842] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id '85351b84-d9cd-11e3-b0f9-f46d04929542')
[INFO] [WallTime: 1399896717.103181] Capability 'std_capabilities/DifferentialMobileBase' being stopped because it has zero references
[depthimage_to_laserscan-18] killing on exit
[camera/disparity_registered_hw-17] killing on exit
[camera/disparity_registered_sw-16] killing on exit
[camera/points_xyzrgb_hw_registered-14] killing on exit
[camera/disparity_depth-15] killing on exit
[camera/depth_registered_rectify_depth-13] killing on exit
[camera/points_xyzrgb_sw_registered-12] killing on exit
[camera/register_depth_rgb-11] killing on exit
[camera/depth_points-10] killing on exit
[camera/depth_metric-9] killing on exit
[camera/depth_metric_rect-8] killing on exit
[camera/depth_rectify_depth-7] killing on exit
[camera/rectify_ir-6] killing on exit
[camera/rectify_color-5] killing on exit
[camera/rectify_mono-4] killing on exit
[camera/debayer-3] killing on exit
[camera/driver-2] killing on exit
[camera/camera_nodelet_manager-1] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done
[INFO] [WallTime: 1399896719.540929] Capability Provider 'turtlebot_capabilities/rgbd_sensor' for Capability 'std_capabilities/RGBDSensor' has terminated.
[turtlebot_bringup-2] killing on exit
Traceback (most recent call last):
File "/opt/kobuki_workspace/turtlebot/src/turtlebot/turtlebot_capabilities/providers/launch/placeholder.py", line 6, in <module>
sleep(0.1)
KeyboardInterrupt
shutting down processing monitor...
... shutting down processing monitor complete
done
[robot_state_publisher-1] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done
[ERROR] [WallTime: 1399896720.232718] Capability Instance 'turtlebot_capabilities/turtlebot2_bringup' cannot transition to 'stopping' from anything but 'launching' or 'running', current state is 'stopping'
[INFO] [WallTime: 1399896720.233108] Capability Provider 'turtlebot_capabilities/turtlebot2_bringup' for Capability 'turtlebot_capabilities/TurtleBotBringup' has terminated.
[INFO] [WallTime: 1399896720.234642] Stopping the 'turtlebot_capabilities/diagnostics' provider of the 'std_capabilities/Diagnostics' interface, because it has no dependents left.
[ERROR] [WallTime: 1399896720.235318] RuntimeError: No running launch file with PID of '13277'
[INFO] [WallTime: 1399896720.240342] Request to free usage of capability 'std_capabilities/RGBDSensor' (bond id '85351b84-d9cd-11e3-b0f9-f46d04929542')
[ERROR] [WallTime: 1399896720.240845] RuntimeError: No Capability 'std_capabilities/RGBDSensor' in use
[ERROR] [WallTime: 1399896720.241382] Rapp Manager : Failed to stop capabilities. Errors: ["Exception occurred while stopping cap 'std_capabilities/DifferentialMobileBase': service [/capability_server/free_capability] responded with an error: error processing request: No running launch file with PID of '13277'", "Exception occurred while stopping cap 'std_capabilities/RGBDSensor': service [/capability_server/free_capability] responded with an error: error processing request: No Capability 'std_capabilities/RGBDSensor' in use"]
As you can see in the above example (check the last few lines) , the capability server cannot find the corresponding launch file PID, when stopping the std_capabilities/DifferentialMobileBase
capability. As a result this capability remains running.
Interesting is, that this capability launcher seems to be registered under a different PID:
$ rosservice call /capability_server/get_running_capabilities
running_capabilities:
-
capability:
capability: kobuki_capabilities/KobukiBringup
provider: kobuki_capabilities/kobuki_bringup
dependent_capabilities:
-
capability: std_capabilities/DifferentialMobileBase
provider: kobuki_capabilities/kobuki_differential_mobile_base
started_by: kobuki_capabilities/kobuki_differential_mobile_base
pid: 13264
-
capability:
capability: std_capabilities/DifferentialMobileBase
provider: kobuki_capabilities/kobuki_differential_mobile_base
dependent_capabilities: []
started_by: user service call
pid: 13267
Also, stopping this capability by hand works fine.
t$ rosservice call /capability_server/stop_capability "capability: 'std_capabilities/DifferentialMobileBase'"
successful: True
[INFO] [WallTime: 1399896720.970543] Capability Provider 'turtlebot_capabilities/robot_state_publisher' for Capability 'std_capabilities/RobotStatePublisher' has terminated.
[INFO] [WallTime: 1399896720.971037] Capability Provider 'turtlebot_capabilities/diagnostics' for Capability 'std_capabilities/Diagnostics' has terminated.
[INFO] [WallTime: 1399896883.873052] Request to stop capability 'std_capabilities/DifferentialMobileBase'
[INFO] [WallTime: 1399896883.874446] Capability Provider 'kobuki_capabilities/kobuki_differential_mobile_base' for Capability 'std_capabilities/DifferentialMobileBase' has terminated.
[INFO] [WallTime: 1399896883.874859] Stopping the 'kobuki_capabilities/kobuki_bringup' provider of the 'kobuki_capabilities/KobukiBringup' interface, because it has no dependents left.
[mobile_base-1] killing on exit
[ INFO] [1399896883.960348772]: Kobuki : waiting for kobuki thread to finish [/mobile_base].
shutting down processing monitor...
... shutting down processing monitor complete
done
[INFO] [WallTime: 1399896885.430019] Capability Provider 'kobuki_capabilities/kobuki_bringup' for Capability 'kobuki_capabilities/KobukiBringup' has terminated.
Ok, I'll try to look into in what cases the PID would not be available and see if I can improve the error handling as well as figure out why it might be happening and close that loop hole.
Requesting to stop a stopping cap probably shouldn't be treated as an error
So what should happen if you try to stop a capability which is not running at all? I would say an error and so if you treat a "stopping" capability differently then you are in a race condition to get an exception, which I think is pretty bad.
I think instead that if the code which is calling stop capability might be doing so redundantly then it should explicitly catch and pass on that failure. Do you have some use case in mind where that doesn't make sense?
I still haven't found where this problem of missing PID's is coming in from. I'll keep digging.
Requesting to stop a stopping cap probably shouldn't be treated as an error
So what should happen if you try to stop a capability which is not running at all?
Well, here the capability is still running, but currently shutting down. Hence, stopping a already stopping cap could be treated with sending out a warning at max.
My user code (the rapp manager) actually doesn't actually call stop capability at all. It's only using use/free capability. In the present case it seems like the long shut down time of the RGBDSensor cap leads to some kind of race condition.
The strange things to me here are:
std_capabilities/DifferentialMobileBase
's launcher with PID 13277, when the actual PID is 13267?If you like to test this yourself, you need
caps_rapps_upgrade
branch of the turtlebot repo.caps_start_stop_fix
branch of the rocon_app_platform repo.Then
$ roslaunch turtlebot_bringup rapp_platform.launch --screen
And
$ rosservice call /turtlebot/start_rapp "name: 'rocon_apps/teleop'"
Well, here the capability is still running, but currently shutting down.
That's my point, if you treat calling "stop capability" on a terminated (not running) and a stopping capability differently then you will have race condition dependent behavior. I don't think that is great, therefore I would say calling stop on a stopping capability is an error.
My user code (the rapp manager) actually doesn't actually call stop capability at all. It's only using use/free capability.
There we agree, there is something fishy going on which is exposing this problem internally, but I think this is separate from what should be done when stop is called (externally or internally) on a stopping capability.
Thanks I'll try to set that up locally and reproduce the same issue.
Well, here the capability is still running, but currently shutting down.
That's my point, if you treat calling "stop capability" on a terminated (not running) and a stopping capability differently then you will have race condition dependent behavior. I don't think that is great, therefore I would say calling stop on a stopping capability is an error.
:+1:
@wjwwood is there any progress on this? It's a real show stopper right now. Without this fixed, I can't merge rapps with caps support with the main branches in ROCON.
@bit-pirate I made a little progress about three days ago getting the environment setup and identifying potential problems, but I haven't fixed the issue yet. I'll spend time on it today.
Currently testing the
rocon_apps/teleop
rapp with capability support and noticed a strange error when shutting down the rapp and its dependent capabilities.So, here an error is thrown when the
RGBDSensor
capability is requested to be stopped, while it is in the stopping state. This error is reported in the service response, which makes the rapp manager abort the stopping of the other caps.I see two improvements required.
I'll look into the stopping capability issue in the rapp manager (https://github.com/robotics-in-concert/rocon_app_platform/issues/233). @wjwwood could you take a look at the capability server part of this issue?